In this benchmark report, we compare our latest v8 model architecture transcription accuracy between AssemblyAI, Google Cloud Speech-to-Text, and AWS Transcribe on a variety of audio use cases.
The use cases for Speech-to-Text transcription are almost limitless, but many of the most popular use cases combine Automated Speech Recognition (ASR) with meetings, phone calls, or media. Whether you are adding captions to your video content, reviewing phone calls to improve team member performance, or adding content safety and topic labels to a podcast, transcription accuracy is always a top priority.
Here, we will review audio files from a wide range of sources. Then we will present a side-by-side comparison of which ASR models--AssemblyAI, Google Cloud Speech-to-Text, and AWS Transcribe--have the highest transcription accuracy.
We also share results for the same audio using AssemblyAI's AI models such as Topic Detection, Keyword Detection, PII Redaction, Content Safety Detection, Sentiment Analysis, Summarization, and Entity Detection.
Speech Recognition Accuracy
We used audio with a wide range of accents, audio quality, number of speakers, and industry-specific vocabularies. This included audio taken from product demos, tutorial videos, documentaries, podcasts, sports talk radio, and corporate earnings calls.
Here is more about our dataset below:
How We Calculate Accuracy
- First, we transcribe the files in our dataset automatically through the specified APIs (AssemblyAI, Google, and AWS).
- Second, we transcribe the files in our dataset by human transcriptionists—to approximately 100% accuracy.
- Finally, we compare the API's transcription with our human transcription to calculate Word Error Rate (WER) — more below.
In the table that follows, we outline the accuracy score that each transcription API achieved on each audio file. Each result is hyperlinked to a diff of the human transcript versus each API's automatically generated transcript. This helps highlight the key differences between the human transcripts and the automated transcripts.
The above accuracy scores were calculated using Word Error Rate (WER). WER is the industry-standard metric for calculating the accuracy of an Automatic Speech Recognition system. WER compares the API-generated transcription to the human transcription for each file, counting the number of insertions, deletions, and substitutions made by the automated system.
Before calculating the WER for a particular file, both the truth (human transcriptions) and the automated transcriptions must be normalized into the same format. This is an extremely important step that many teams miss, which ends up resulting in misleading WER numbers.
That's because according to the WER algorithm,
"Hello!" would be labeled as two distinct words, since one has an exclamation and the other does not, which results in a skewed percentage. To perform the most accurate WER analysis, all punctuation and casing is removed from both the human and automated transcripts, and all numbers are converted to the spoken format, as outlined below.
Hi my name is Bob I am 72 years old.
hi my name is bob i am seventy two years old
Additional AssemblyAI Models
In addition to Speech-to-Text transcription, AssemblyAI provides additional NLP-based AI models. Below, we ran these same benchmark files through each of our additional AI models.
Our Topic Detection model uses the IAB Taxonomy and can classify transcription texts with up to 698 possible topics. For example, a Tesla release event streamed on Zoom would be classified with the "Automobiles>SelfDrivingCars" topic, among others.
You can find a list of all the topic categories that can be detected in the AssemblyAI Documentation.
AssemblyAI can also automatically extract keywords and phrases based on the transcription text. Below is an example of how the model works on a small sample of transcription text.
Hi I'm joy. Hi I'm Sharon. Do you have kids in school? I have grandchildren in school. Okay, well, my kids are in middle school in high school. Do you think there is anything wrong with the school system?
Using the same dataset, we included the topics and keywords automatically detected in each file below:
Audio and video files can contain sensitive customer information like credit card numbers, addresses, and phone numbers. Other sensitive information like birth dates and medical records can also be stored in recordings and transcripts. AssemblyAI offers PII Detection and Redaction for audio and video files transcribed using our API.
You can learn more about how this works in the AssemblyAI Documentation. The full list of entity types our PII Detection and Redaction model can detect below:
With PII Detection and Redaction enabled, we ran the above files through our API to generate a redacted transcript for each recording. Below is an excerpt from one of the recordings with links to the full transcriptions.
Good afternoon. My name is [PERSON_NAME] and I will be your [OCCUPATION] [OCCUPATION] today. Hey, everyone, welcome to the [ORGANIZATION] one on one webinar. Before I get started, let's quickly cover a few housekeeping items. If you have any questions, we will have a Q and A session at the end, so please submit questions that pops up during the webinar. I have several colleagues on the line to help you answer questions during the Webinar. We have [OCCUPATION], [OCCUPATION], [OCCUPATION] [OCCUPATION], [OCCUPATION] [OCCUPATION], and so on. Everyone on here is on mute except me, so please use the Q amp a feature. This session is up to an hour long, so if you need to drop off, this webinar is recorded and will be shared with you later on. So it's up to you if you'd like to follow along with my examples or sit back and enjoy the show. So my name is [PERSON_NAME] [PERSON_NAME] and I'm the [OCCUPATION] OCCUPATION] at [ORGANIZATION]. You're probably wondering what's our [OCCUPATION] [OCCUPATION] I work in our [ORGANIZATION] [ORGANIZATION] [ORGANIZATION] and I serve as an expert on the key use cases and features of the [ORGANIZATION] platform.
Content Safety Detection
This model can flag portions of a transcription that contain sensitive content such as hate speech, company financials, gambling, weapons, or violence. The full list of what the AssemblyAI API will flag can be found in the API documentation.
Legacy software that attempts to automate this process takes a "blacklist" approach - which looks for specific words (for e.g.,
"gun") in order to flag sensitive content. This approach is extremely error-prone and brittle because it fails to take slang into account. For example, "That burrito is bomb!"
Since AssemblyAI's Content Safety Detection model is built using State-of-the-Art AI models, our model looks at the entire context of a word/sentence when deciding when to flag a piece of content or not – we don't rely on error-prone backlist approaches outlined above.
Below, we review the results of AssemblyAI's Content Safety Detection model on the above dataset:
Sentiment Analysis refers to detecting the sentiment of specific speech segments throughout an audio or video file. In Sentiment Analysis, the goal is to take your audio or video file and produce three potential outputs--positive, negative, or neutral.
Sentiment Analysis can be a powerful analytics tool that helps companies make better-informed decisions to improve products, customer relations, agent training, and more.
Learn more about the Sentiment Analysis model in the AssemblyAI Documentation.
Enabled by our summarization models, Auto-Chapters provides a "summary over time" for audio content transcribed with AssemblyAI's Speech-to-Text API. It works by first breaking audio/video files into logical "chapters" as the topic of conversation changes, and then provides an automatically generated summary for each "chapter" of content.
Behind the Auto Chapters feature is a set of powerful AI Machine Learning models. The first model is able to segment an audio file into "chapters" (i.e., detect when the topic changes), and the second model summarizes those chapters into bite-sized summaries.
Check out the AssemblyAI Documentation to learn more about our Summarization model.
Entity Detection, also referred to as Named Entity Recognition, is a model that can identify and categorize key information in a text or a transcript derived from Automatic Speech Recognition (ASR) technology.
In its most basic form, Entity Detection is a two-step process: (1) identifying the entity and (2) classifying the entity. For example, you might identify the entity as
New York City and the category as
location or the entity as
AssemblyAI and the category as
You can learn more about how this works in the AssemblyAI Documentation. The full list of entity types that can be detected are below:
By collecting this entity information, you empower your company with valuable customer or employee information, regardless of industry. Then, you can perform analytics to better understand customers, adjust marketing campaigns, modify products, and much more.