Industry

The top free Speech-to-Text APIs, AI Models, and Open Source Engines

This post compares the best free Speech-to-Text APIs and AI models on the market today, including APIs that have a free tier. We’ll also look at several free open-source Speech-to-Text engines and explore why you might choose an API vs. an open-source library, or vice versa.

The top free Speech-to-Text APIs, AI Models, and Open Source Engines

Choosing the best Speech-to-Text API, AI model, or open-source engine to build with can be challenging. You need to compare accuracy, model design, features, support options, documentation, security, and more.

This post examines the best free Speech-to-Text APIs and AI models on the market today, including ones that have a free tier, to help you make an informed decision. We’ll also look at several free open-source Speech-to-Text engines and explore why you might choose an API or AI model vs. an open-source library, or vice versa.

Free Speech-to-Text APIs and AI Models

APIs and AI models are more accurate, easier to integrate, and come with more out-of-the-box features than open-source options. However, large-scale use of APIs and AI models can come with a higher cost than open-source options.

If you’re looking to use an API or AI model for a small project or a trial run, many of today’s Speech-to-Text APIs and AI models have a free tier. This means that the API or model is free for anyone to use up to a certain volume per day, per month, or per year.

Let’s compare three of the most popular Speech-to-Text APIs and AI models with a free tier: AssemblyAI, Google, and AWS Transcribe.

AssemblyAI

AssemblyAI is an API platform that offers AI models that accurately transcribe and understand speech, and enable users to extract insights from voice data. AssemblyAI offers cutting-edge AI models such as Speaker Diarization, Topic Detection, Entity Detection, Automated Punctuation and Casing, Content Moderation, Sentiment Analysis, Text Summarization, and more. These AI models help users get more out of voice data, with continuous improvements being made to accuracy.

AssemblyAI also offers LeMUR, which enables users to leverage Large Language Models (LLMs) to pull valuable information from their voice data—including answering questions, generating summaries and action items, and more. 

The company offers up to 100 free transcription hours for audio files or video streams, with a concurrency limit of 5, before transitioning to an affordable paid tier.

Its high accuracy and diverse collection of AI models built by AI experts make AssemblyAI a sound option for developers looking for a free Speech-to-Text API. The API also supports virtually every audio and video file format out-of-the-box for easier transcription.

AssemblyAI has expanded the languages it supports to include English, Spanish, French, German, Japanese, Korean, and much more, with additional languages being released monthly. See the full list here.

AssemblyAI’s easy-to-use models also allow for quick set-up and transcription in any programming language. You can copy/paste code examples in your preferred language directly from the AssemblyAI Docs or use the AssemblyAI Python SDK or another one of its ready-to-use integrations.

Pricing

  • Free to test in the AI playground, plus 100 free hours of asynchronous transcription with an API sign-up
  • Speech-to-Text – $0.37 per hour
  • Real-time Transcription – $0.47 per hour
  • Audio Intelligence – varies, $.01 to $.15 per hour
  • LeMUR – varies
  • Enterprise pricing is also available

See the full pricing list here.

Pros 

  • High accuracy
  • Breadth of AI models available, built by AI experts
  • Continuous model iteration and improvement
  • Developer-friendly documentation and SDKs
  • Enterprise-grade support and security

Cons

  • Models are not open-source
Try one of AssemblyAI’s integrations to quickly transcribe an audio file

Google

Google Speech-to-Text is a well-known speech transcription API. Google gives users 60 minutes of free transcription, with $300 in free credits for Google Cloud hosting.

Google only supports transcribing files already in a Google Cloud Bucket, so the free credits won’t get you very far. Google also requires you to sign up for a GCP account and project — whether you're using the free tier or paid.

With good accuracy and 125+ languages supported, Google is a decent choice if you’re willing to put in some initial work.

Pricing

  • 60 minutes of free transcription
  • $300 in free credits for Google Cloud hosting

Pros 

  • Free tier
  • Decent accuracy
  • Multi-language support

Cons

  • Only supports transcription of files in a Google Cloud Bucket
  • Difficult to get started
  • Lower accuracy than other similarly-priced APIs

AWS Transcribe

AWS Transcribe offers one hour free per month for the first 12 months of use.

Like Google, you must create an AWS account first if you don’t already have one. AWS also has lower accuracy compared to alternative APIs and only supports transcribing files already in an Amazon S3 bucket.

However, if you’re looking for a specific feature, like medical transcription, AWS has some options. Its Transcribe Medical API is a medical-focused ASR option that is available today.

Pricing

  • One hour free per month for the first 12 months of use
  • Tiered pricing, based on usage, ranges from $0.02400 to $0.00780

Pros 

  • Integrates into existing AWS ecosystem
  • Medical language transcription
  • Decent accuracy

Cons

  • Difficult to get started from scratch
  • Only supports transcribing files already in an Amazon S3 bucket
  • Lower accuracy than other similarly-priced APIs

Open-Source Speech Transcription engines

An alternative to APIs and AI models, open-source Speech-to-Text libraries are completely free--with no limits on use. Some developers also see data security as a plus, since your data doesn’t have to be sent to a third party or the cloud.

There is work involved with open-source engines, so you must be comfortable putting in a lot of time and effort to get the results you want, especially if you are trying to use these libraries at scale. Open-source Speech-to-Text engines are typically less accurate than the APIs discussed above.

If you want to go the open-source route, here are some options worth exploring:

DeepSpeech

DeepSpeech is an open-source embedded Speech-to-Text engine designed to run in real-time on a range of devices, from high-powered GPUs to a Raspberry Pi 4. The DeepSpeech library uses end-to-end model architecture pioneered by Baidu.

DeepSpeech also has decent out-of-the-box accuracy for an open-source option and is easy to fine-tune and train on your own data.

Pros 

  • Easy to customize
  • Can use it to train your own model
  • Can be used on a wide range of devices

Cons

  • Lack of support
  • No model improvement outside of individual custom training
  • Heavy lift to integrate into production-ready applications
See Also: DeepSpeech Tutorial for Asynchronous and Real-time transcription

Kaldi

Kaldi is a speech recognition toolkit that has been widely popular in the research community for many years.

Like DeepSpeech, Kaldi has good out-of-the-box accuracy and supports the ability to train your own models. It’s also been thoroughly tested—a lot of companies currently use Kaldi in production and have used it for a while—making more developers confident in its application.

Pros 

  • Decent accuracy
  • Can use it to train your own models
  • Active user base

Cons

  • Can be complex and expensive to use
  • Uses a command-line interface
  • Heavy lift to integrate into production-ready applications
You May Also Like: Kaldi Speech Recognition for Beginners Tutorial

Flashlight ASR (formerly Wav2Letter)

Flashlight ASR, formerly Wav2Letter, is Facebook AI Research’s Automatic Speech Recognition (ASR) Toolkit. It is also written in C++ and usesthe ArrayFire tensor library.

Like DeepSpeech, Flashlight ASR is decently accurate for an open-source library and is easy to work with on a small project.

Pros 

  • Customizable
  • Easier to modify than other open-source options
  • Processing speed

Cons

  • Very complex to use
  • No pre-trained libraries available
  • Need to continuously source datasets for training and model updates, which can be difficult and costly

SpeechBrain

SpeechBrain is a PyTorch-based transcription toolkit. The platform releases open implementations of popular research works and offers a tight integration with Hugging Face for easy access.

Overall, the platform is well-defined and constantly updated, making it a straightforward tool for training and finetuning.

Pros 

  • Integration with Pytorch and Hugging Face
  • Pre-trained models are available
  • Supports a variety of tasks

Cons

  • Even its pre-trained models take a lot of customization to make them usable
  • Lack of extensive docs makes it not as user-friendly, except for those with extensive experience

Coqui

Coqui is another deep learning toolkit for Speech-to-Text transcription. Coqui is used in over twenty languages for projects and also offers a variety of essential inference and productionization features.

The platform also releases custom-trained models and has bindings for various programming languages for easier deployment.

Pros 

  • Generates confidence scores for transcripts
  • Large support comunity
  • Pre-trained models are available

Cons

  • No longer updated and maintained by Coqui
  • No model improvement outside of individual custom training
  • Heavy lift to integrate into production-ready applications

Whisper

Whisper by OpenAI, released in September 2022, is comparable to other current state-of-the-art open-source options.

Whisper can be used either in Python or from the command line and can also be used for multilingual translation.

Whisper has five different models of varying sizes and capabilities, depending on the use case, including v3 released in November 2023.

However, you’ll need a fairly large computing power and access to an in-house team to maintain, scale, update, and monitor the model to run Whisper at a large scale, making the total cost of ownership higher compared to other options. 

As of March 2023, Whisper is also now available via API. On-demand pricing starts at $0.006/minute.

Pros 

  • Multilingual transcription
  • Can be used in Python
  • Five models are available, each with different sizes and capabilities

Cons

  • Need an in-house research team to maintain and update
  • Costly to run
  • Heavy lift to integrate into production-ready applications
See Also: How to run OpenAI’s Whisper speech recognition model

Which free Speech-to-Text API, AI model, or Open Source engine is right for your project?

The best free Speech-to-Text API, AI model, or open-source engine will depend on our project. Do you want something that is easy-to-use, has high accuracy, and has additional out-of-the-box features? If so, one of these APIs might be right for you:

Alternatively, you might want a completely free option with no data limits—if you don’t mind the extra work it will take to tailor a toolkit to your needs. If so, you might choose one of these open-source libraries:

Whichever you choose, make sure you find a product that can continually meet the needs of your project now and what your project may develop into in the future.