Speech-to-Text recognition technology has come a long way since Bell Laboratories invented “Audrey” in the 1950s. Audrey could only comprehend numbers, and it wasn’t until a decade later that researchers added rudimentary word comprehension.
Today, AI Speech-to-Text recognition transcription accuracy is fast approaching human accuracy levels. Machine learning and deep learning research have also pushed Speech-to-Text beyond asynchronous transcription to include real-time transcription--which has brought with it a boom in products and services leveraging Speech-to-Text technology.
Significant strides in accuracy, accessibility, and affordability means more companies are looking for industry-best Speech-to-Text APIs to power innovative products and features. With more Speech-to-Text APIs on the market than ever before, how do you choose the best one for your product or use case? Answering these six questions is a great starting point:
1. How Accurate is the API?
Accuracy is the most important consideration when comparing APIs. Word Error Rate, or WER, is the measure of accuracy of an Automatic Speech Recognition (ASR) system. It’s calculated by comparing the ASR’s transcription to a human transcription.
The most thorough way to calculate the accuracy of a Speech-to-Text API is to determine the WER on your audio/video files. While this approach is the most thorough, it does come with a lot of work! It, among other steps, requires getting your audio/video files transcribed by a human, transcribing those files with the Speech-to-Text APIs you are evaluating, and then computing the WER on your audio/video files.
Trying to determine WER yourself is a lot of work, and error prone. To make this process easier, look for Speech-to-Text APIs that offer a free benchmark report on your audio/video files. The report should include the WER results, along with other key details such as features and pricing.
It should be noted that while WER is the recognized standard for determining accuracy, it is not without its limitations. For example, WER fails to take context into account--a more accurate transcription does not always equal a more understandable transcription. However, it’s always a good starting point for comparison.
Another great resource for comparing API accuracy is Diffchecker. Diffchecker lets you compare two blocks of text--say from two different APIs or from one API and one human transcription--and shows you what has been added and what has been removed. It also lets you eyeball the differences between two large blocks of texts for a quick comparison.
When using Diffchecker, ask yourself: what did the APIs miss? Are proper nouns capitalized? Does the speaker’s accent or dialect throw off the transcription? Was context a factor?
See this text comparison using Diffchecker as an example:
As you can see, Text 1 has 12 removals and Text 2 has 11 additions. Look closely at the highlighted text to spot some of the nuances, such as “black as” in Text 1 vs. “Black is” in Text 2.
Together, WER and Diffchecker can be powerful tools for determining accuracy.
2. What Additional Features does the API Offer?
Next, you should see what additional features the API offers. This will help you get more out of the raw transcription.
Common AI-powered features include:
- Speaker diarization
- PII redaction
- Auto transcript highlights
- Topic detection
- Content safety detection
- Paragraph detection
- Confidence scores
- Automatic punctuation and casing
- Profanity filtering
- Custom vocabulary
When choosing a Speech-to-Text API, you should also evaluate how often new features are released and how often the models are updated.
The best Speech-to-Text APIs have a deep learning research team on-staff working to continuously improve the API and release new features, like AssemblyAI’s recently improved Topic Detection update. In the field of ASR, some features have a long way to go before they reach human accuracy. The API you choose should always be working to improve its models and to boost accuracy.
Make sure you check the API’s changelog and updates, which should be transparent and easily accessible. For example, AssemblyAI ships updates weekly via its publicly accessible changelog. If an API doesn’t have a changelog, or doesn’t update it very often, this is a red flag.
3. What Kind of Support Can You Expect?
Too often, APIs offered by big tech companies like Google Cloud and AWS go unsupported and are infrequently updated.
It’s inevitable that you’ll have questions or need support as you leverage a Speech-to-Text API to build new features into your product. This is why you should look for an API that offers dedicated, quick support to you and your team of developers. Support should be offered 24/7 via multiple channels such as email, messaging, or Slack.
You should be assigned a dedicated account manager and support engineer that offer integration support, provide quick turnaround on support requests, and help you figure out the best features to integrate.
- Uptime reports (should be at or near 100%)
- Customer reviews and awards on sites like G2
- Accessible changelog with detailed and frequent updates, as discussed above
4. Does the API Offer Transparent Pricing and Documentation?
API pricing shouldn’t be a guessing game. All APIs you are considering should offer transparent, easy-to-decipher pricing as well as volume discounts for high levels of usage. A free trial for the API that lets you explore the API before committing to purchase is even better.
Watch out for hidden extra costs--for example, Google Cloud’s Speech-to-Text API will only transcribe data hosted in GCP Buckets--that could substantially increase your costs.
API documentation should also be readily and easily accessible. This will give you a better sense of how easy it will be to integrate the API into your application.
5. How Secure is Your Data?
Data security is always a top consideration when integrating an API into your tech stack.
Before you choose a Speech-to-Text API, make sure you ask:
- Does the API keep a copy of my audio/video files in order to improve its model?
- Does the API keep a copy of my transcription files?
- If it does keep a copy, can I request that it permanently delete my audio/video or transcription files at any time? How quickly will my request be met?
- Does the API monetize my data?
Unfortunately, many APIs answer “yes” to the above questions--don’t assume they prioritize your data security over their personal gain! Instead, look for APIs like AssemblyAI that take data security seriously by answering “no” to each question.
6. Is Innovation a Priority?
The field of Speech-to-Text recognition is in a state of constant innovation. Any API you consider should have a strong focus on AI and deep learning research.
Also ensure that the API directs its research toward frequent model updates. Features like speaker diarization and sentiment analysis still have a way to go to reach human accuracy levels, so it’s important that the team is constantly working to improve these areas using the latest advances in deep learning and AI research.
The API’s changelogs are a good way to determine the difference between an API stating they prioritize innovation and an API demonstrating that they are truly innovating. Pay attention to descriptions of model versioning and how they split up model updates.
For example, AssemblyAI ships detailed updates for features like ITN and Punctuation via its changelog regularly. Others may have a changelog but give limited insight.
Comparing Speech-to-Text APIs
There is obviously a lot to think about when comparing Speech-to-Text APIs!
To summarize, here are the key questions to ask each API:
- How accurate is the API?
- What additional features does the API offer?
- What kind of support can you expect?
- Does the API offer transparent pricing and documentation?
- How secure is your data?
- Is innovation a priority?
Taking the time to do research now will set you up for long-term success with your Speech-to-Text API partner.
If you’re interested in learning more, feel free to schedule a call with one of our Deployment Engineers at AssemblyAI!