Choosing the best Speech-to-Text API, AI model, or open source engine to build with can be challenging. You’ll need to compare accuracy, model design, features, support options, documentation, security, and more.
But what if you have a small project to complete? Or simply want to play around with an API or AI model or test an API before committing to building with one?
This post compares the best free Speech-to-Text APIs and AI models on the market today, including ones that have a free tier, to help you make an informed decision. We’ll also look at several free open source Speech-to-Text engines and explore why you might choose an API or AI model versus an open source library, or vice versa.
Free Speech-to-Text APIs and AI Models
APIs and AI models are more accurate, easier to integrate, and come with more out-of-the-box features than open source options. However, large scale use of APIs and AI models typically comes with a cost.
But if you’re looking to use an API or AI model for a small project or for a trial run, many of today’s Speech-to-Text APIs and AI models have a free tier. This means that the API or model is free for anyone to use up to a certain volume per day, per month, or per year.
Let’s look at three of the most popular Speech-to-Text APIs and AI models with a free tier: AssemblyAI, Google, and AWS Transcribe.
AssemblyAI, an API platform for state-of-the-art AI models, is a leading name in the Speech-to-Text API market. This AI startup is growing quickly thanks to industry-best accuracy, an easy-to-use interface, and cutting-edge AI models such as Speaker Diarization, Topic Detection, Entity Detection, Automated Punctuation and Casing, Content Moderation, Sentiment Analysis, Text Summarization, and more. The company also just released LeMUR, the easiest way to build LLM apps on spoken data.
AssemblyAI recently improved transcription accuracy further with the release of its Conformer-2 model, which was trained on 1.1M hours of audio data. The model also provides improvements on proper nouns, alphanumerics, and robustness to noise.
The company offers several free transcription hours for audio files or video streams per month before transitioning to an affordable paid tier.
Its high accuracy and collection of AI models like Speaker Diarization and Sentiment Analysis makes AssemblyAI a sound option for developers looking for a free Speech-to-Text API. The API also supports virtually every audio and video file format out-of-the-box for easier transcription.
AssemblyAI has expanded the languages it supports to include English, Spanish, French, German, Japanese, Korean, and much more, with additional languages being released monthly. See the full list here.
AssemblyAI’s easy-to-use models also allow for quick set-up and transcription in any programming language. You can even copy/paste code examples in your preferred language directly from the AssemblyAI Docs or use the AssemblyAI Python SDK.Try AssemblyAI’s Python SDK to quickly transcribe an audio file
Google Speech-to-Text is a well known speech transcription API. Google gives users 60 minutes free transcription, with $300 in free credits for Google Cloud hosting.
However, since Google only supports transcribing files already in a Google Cloud Bucket, the free credits won’t get you very far. Google can also be a bit difficult to get started with since you need to sign up for a GCP account and project, even to use the free tier, which is surprisingly complicated.
Still, with good accuracy and 63+ languages supported, Google is a decent choice if you’re willing to put in some initial work.
AWS Transcribe offers one hour free per month for the first 12 months of use.
Like Google, you must create an AWS account first if you don’t already have one, which is a complex process. AWS also has lower accuracy compared to alternative APIs and only supports transcribing files already in an Amazon S3 bucket.
However, if you’re looking for a specific feature, like medical transcription, AWS has some intriguing options. Its Transcribe Medical API is a medical-focused ASR option that is available today.
Open Source Speech-to-Text Transcription Engines
An alternative to APIs and AI models, open source Speech-to-Text libraries are completely free--with no limits on use. Some developers also see data security as a plus, since your data doesn’t have to be sent to a third party or to the cloud.
Be warned--there is a high lift involved with open source engines, so you must be comfortable putting in a lot of work to get the results you want, especially if you are trying to use these libraries at scale. Open source Speech-to-Text engines are also typically less accurate than the APIs discussed above.
If you want to go the open source route, however, here are some options worth exploring:
DeepSpeech is an open source embedded Speech-to-Text engine designed to run in real-time on a range of devices, from high-powered GPUs to a Raspberry Pi 4. The DeepSpeech library uses end-to-end model architecture pioneered by Baidu.
DeepSpeech also has decent out-of-the-box accuracy for an open source option, and is easy to fine tune and train on your own data.See Also: DeepSpeech Tutorial for Asynchronous and Real-time Transcription
Kaldi is a speech recognition toolkit that has been widely popular in the research community for many years.
Like DeepSpeech, Kaldi has good out-of-the-box accuracy and supports the ability to train your own models. It’s also been thoroughly tested--a lot of companies currently use Kaldi in production and have used it for a while--making more developers confident in its application.You May Also Like: Kaldi Speech Recognition for Beginners Tutorial
Wav2Letter is Facebook AI Research’s Automatic Speech Recognition (ASR) Toolkit, also written in C++, and using the ArrayFire tensor library.
Like DeepSpeech, Wav2Letter is decently accurate for an open source library and is easy to work with on a small project.
Overall, the platform is well-defined and constantly updated, making it a straightforward tool for training and finetuning.
Coqui is another deep learning toolkit for Speech-to-Text transcription. Coqui is used in over twenty languages for projects and also offers a variety of essential inference and productionization features.
The platform also releases custom trained models and has bindings for various programming languages for easier deployment.
Whisper by OpenAI, released in September 2022, is comparable to other current state-of-the-art open source options.
Whisper can be used either in Python or from the command line and can also be used for multilingual translation.
Whisper has five different models of varying sizes and capabilities, depending on use case. However, you’ll need a fast GPU (the other open source options can reasonably be used on CPU) and access to an in-house team to maintain, scale, update, and monitor the model to run Whisper at a large scale, making the total cost of ownership higher compared to other options.As of March 2023, Whisper is also now available via API with increased speed and more cost-effective results. On-demand pricing starts at $0.006/minute.See Also: How to Run OpenAI’s Whisper Speech Recognition Model
Which Free Speech-to-Text API, AI model, or Open Source Engine is Right for Your Project?
The best free Speech-to-Text API, AI model, or open source engine will depend on our project. Do you have a small project and want something that is easy-to-use, has high accuracy, and additional out-of-the-box features? If so, one of these APIs might be right for you:
Alternatively, you might want a completely free option with no data limits – if you don’t mind the extra work it will take to tailor a toolkit to your needs. If so, you might choose one of these open source libraries:
Whichever you choose, make sure you find a product that can continually meet the needs of your project now and what your project may develop into in the future.