How to Choose the Best Speech-to-Text API

With more Speech-to-Text APIs on the market than ever before, how do you choose the best one for your product or use case? Answering these six questions is a great starting point.

How to Choose the Best Speech-to-Text API

Speech-to-text recognition technology has come a long way since Bell Laboratories invented “Audrey” in the 1950s. Audrey could only comprehend numbers, and it wasn’t until a decade later that researchers added rudimentary word comprehension.

Today, Speech-to-Text recognition and AI transcription accuracy is fast approaching human accuracy levels. Cutting-edge AI research has also pushed Speech-to-Text beyond asynchronous transcription to include real-time transcription--which has brought with it a boom in products and services leveraging Speech-to-Text technology.

Significant strides in accuracy, accessibility, and affordability means more companies are looking for industry-best Speech-to-Text APIs to power innovative products and features. With more Speech-to-Text APIs on the market than ever before, how do you choose the best one for your product or use case? Answering these six questions is a great starting point:

1. How accurate is the API?

Accuracy is the most important consideration when comparing APIs. Word Error Rate, or WER, is the measure of accuracy of an Automatic Speech Recognition (ASR) system. It’s calculated by comparing the ASR model’s transcribed text to a human transcription.

The most thorough way to calculate the accuracy of a Speech-to-Text software or API is to determine the WER on your audio/video files. While this approach is the most thorough, it does come with a lot of work! It, among other steps, requires getting your audio/video files transcribed by a human, transcribing those files with the Speech-to-Text APIs you are evaluating, and then computing the WER on your audio/video files.

Read More: How Useful is Word Error Rate? →

If you’re looking for an accurate speech-to-text API that works on noisy audio, Conformer-2 by AssemblyAI is now available. Conformer-2 is a state-of-the-art speech recognition model trained on 1.1M hours of audio data that achieves near-human level performance and robustness across a variety of data.

Another great resource for comparing API accuracy is Diffchecker. Diffchecker lets you compare two blocks of text--say from two different APIs or from one API and one human transcription--and shows you what has been added and what has been removed. It also lets you eyeball the differences between two large blocks of texts for a quick comparison.

When using Diffchecker, ask yourself: what did the APIs miss? Are proper nouns capitalized? Does the speaker’s accent or dialect throw off the transcription? Was context a factor?

See this text comparison using Diffchecker as an example:

As you can see, Text 1 has 12 removals and Text 2 has 11 additions. Look closely at the highlighted text to spot some of the nuances, such as “black as” in Text 1 vs. “Black is” in Text 2.

Together, WER and Diffchecker can be powerful tools for determining accuracy.

Try Conformer-2 in our free Speech-to-Text playground

2. What additional features and models does the API Offer?

Next, you should see what additional features the API offers. This will help you get more out of the raw transcription.

Common AI-powered features include:

And more.

When choosing a Speech-to-Text API, you should also evaluate how often new features are released and how often the models are updated.

The best Speech-to-Text APIs have an AI research team on-staff working to continuously improve the AI models based on new AI breakthroughs. In the field of ASR, some features have a long way to go before they reach human accuracy. The API you choose should always be working to improve its models and to boost accuracy.

Make sure you check the API’s changelog and updates, which should be transparent and easily accessible. For example, AssemblyAI ships updates weekly via its publicly accessible changelog. If an API doesn’t have a changelog, or doesn’t update it very often, this is a red flag.

3. What kind of support can you expect?

Too often, APIs offered by big tech companies like Google Cloud and AWS go unsupported and are infrequently updated.

It’s inevitable that you’ll have questions or need support as you leverage a Speech-to-Text API to build new features into your product. This is why you should look for an API that offers dedicated, quick support to you and your team of developers. Support should be offered 24/7 via multiple channels such as email, messaging, or Slack.

You should be assigned a dedicated account manager and support engineer that offer integration support, provide quick turnaround on support requests, and help you figure out the best features to integrate.

Also consider:

4. Does the API offer transparent pricing and documentation?

API pricing shouldn’t be a guessing game. All APIs you are considering should offer transparent, easy-to-decipher pricing as well as volume discounts for high levels of usage. A free trial for the API that lets you explore the API before committing to purchase is even better.

Watch out for hidden extra costs--for example, Google Cloud’s Speech-to-Text API will only transcribe data hosted in GCP Buckets--that could substantially increase your costs. OpenAI’s Whisper API sends data in 25MB chunks, which makes it hard to scale or process large files. API documentation should also be readily and easily accessible. This will give you a better sense of how easy it will be to integrate the API into your application.

5. How secure is your data?

Data security is always a top consideration when integrating an API into your tech stack.

Before you choose a Speech-to-Text API, make sure you ask:

  • Does the API keep a copy of my audio/video files in order to improve its model?
  • Does the API keep a copy of my transcription files?
  • If it does keep a copy, can I request that it permanently delete my audio/video or transcription files at any time? How quickly will my request be met?
  • Does the API monetize my data?

Unfortunately, many APIs answer “yes” to the above questions--don’t assume they prioritize your data security over their personal gain! Instead, like AssemblyAI that take data security seriously by answering “no” to each question.

6. Is innovation a priority?

The field of Speech-to-Text recognition is in a state of constant innovation. Any API you consider should have a strong focus on AI research.

Also ensure that the API directs its research toward frequent model updates. Features and models like Speaker Diarization and Sentiment Analysis still have a way to go to reach human accuracy levels, so it’s important that the team is constantly working to improve these areas using the latest advances in AI research.

The API’s changelogs are a good way to determine the difference between an API stating they prioritize innovation and an API demonstrating that they are truly innovating. Pay attention to descriptions of model versioning and how they split up model updates.

For example, AssemblyAI ships detailed updates for features like ITN and Punctuation via its changelog regularly. Others may have a changelog but give limited insight.

Comparing Speech-to-Text APIs

There is obviously a lot to think about when comparing Speech-to-Text APIs!

To summarize, here are the key questions to ask each API:

  1. How accurate is the API?
  2. What additional features does the API offer?
  3. What kind of support can you expect?
  4. Does the API offer transparent pricing and documentation?
  5. How secure is your data?
  6. Is innovation a priority?

Taking the time to do research now will set you up for long-term success with your Speech-to-Text API partner.

If you’re interested in trying our API, get your free speech-to-text API key and transcribe your own audio data.

Similar reads:


Are there free speech-to-text APIs?

Yes! There are several free open source Speech-to-Text engines or providers that offer a free tier. Compare open source vs speech-to-text APIs for your use case.

Is Whisper API the same as AssemblyAI?

AssemblyAI trains speech recognition models by applying and adapting breakthrough AI research from top AI research labs, such as OpenAI. Whisper is a great speech recognition model for small batch transcription, but there are some limitations if you're looking to build at scale. You can learn how to run Whisper here.

What can I do with my Speech-to-Text transcriptions?

Depending on your use case, you may want to build features and products that summarize your text into digestible bullets, detect sentiment in spoken speech, etc. Browse through our AI models to see what you can build.

Is AssemblyAI secure and safe to use?

Yes, AssemblyAI uses enterprise-grade security practices to keep all data safe.

Can speech recognition lower churn on enterprise products?

Yes, creating innovative features such as summarizing customer calls and building recommendation engines are just a couple of ways brands are keeping customers engaged. Contact us directly to discuss your use case.