Media companies like NBC, social platforms like YouTube, and media monitoring solutions like Meltwater use Speech-to-Text technology to power everything from closed captioning to complex content moderation. Developers can now leverage Speech Recognition to unlock more insight into their content quickly and affordably at scale.
In this report, we look at 12 different audio/video files from various sources (shown in more detail below), and review how accurately AssemblyAI, AWS Transcribe, and Google Speech-to-Text, are able to automatically transcribe these files.
In addition to reviewing Speech Recognition accuracy, our research team reviewed the results of AssemblyAI's unique Content Safety, Topic Recognition, and Keyword Detection features on these 12 audio/video files. This report is meant to serve as a point of reference to compare the best solutions Automated Speech Recognition solutions in the market.
Speech Recognition Accuracy
We scoured the internet for a wide variety of media content— from news broadcasts on current events, to user generated social media videos on TikTok, to public podcasts on Spotify. Chosen at random, our intention is to provide you with a healthy sample size of content types from an array of sources as you begin your own comparative analysis.
Here is more about our dataset below:
How We Calculate Accuracy
- First, we transcribe the files in our dataset automatically through APIs (AssemblyAI, Google, and AWS).
- Second, we transcribe the files in our dataset by human transcriptionists—to approximately 100% accuracy.
- Finally, we compare the API's transcription with our human transcription to calculate Word Error Rate (WER)—more below.
Below, we outline the accuracy score that each transcription API achieved on each audio file. Each result is hyperlinked to a diff of the human transcript versus each API's automatically generated transcript. This helps to highlight the key differences between the human transcripts and the automated transcripts.
The above accuracy scores were calculated using the Word Error Rate (WER). WER is the industry-standard metric for calculating the accuracy of an Automatic Speech Recognition system. The WER compares the API generated transcription to the human transcription for each file, counting the number of insertions, deletions, and substitutions made by the automated system in order to derive the WER.
Before calculating the WER for a particular file, both the truth (human transcriptions) and the automated transcriptions must be normalized into the same format. This is a common step that teams miss, which ends up resulting in misleading WER numbers. That's because according to the WER algorithm, "Hello" and "Hello!" are two distinct words, since one has an exclamation and the other does not. That's why, to perform the most accurate WER analysis, all punctuation and casing is removed from both the human and automated transcripts, and all numbers are converted to the spoken format, as outlined below.
truth -> Hi my name is Bob I am 72 years old. normalized truth -> hi my name is bob i am seventy two years old
Content Safety Detection
Protecting brands and users from harmful content is crucial to a number of applications and use cases. With AssemblyAI's Content Safety Detection feature, we can flag portions of your transcription that contain sensitive content such as hate speech, gambling, weapons, or violence. The full list of what the AssemblyAI API will flag can be found in the API documentation.
To moderate audio and video content on the internet today, large teams of people are required to manually go through this content in order to flag anything that might be abusive or in violation of a platform’s policies. For example, Facebook employs tens of thousands of people to manually review posts to their platform, and to flag those posts that include hate speech.
Legacy software that attempts to automate this process takes a "blacklist" approach - which looks for specific words (for eg, "gun") in order to flag sensitive content. This approach is extremely error prone and brittle. Take for example "That burrito is bomb".
Since AssemblyAI's Content Safety Detection model is built using State of the Art Deep Learning models, our model looks at the entire context of a word/sentence when deciding when to flag a piece of content or not - we don't rely on error-prone backlist approaches.
Below we review the results of AssemblyAI's Content Safety Detection feature on the above dataset:
In addition to Content Safety Detection, we reviewed the topics and keywords detected by AssemblyAI's Topic Detection feature. Developers can use the Topic Detection feature to understand the topics discussed in their audio/video files. This information can be useful to drive better indexing of content, recommendations, and even to aid with ad targeting. Our Topic Detection feature uses the IAB Taxonomy, and can classify transcription text with up to 698 possible IAB categories (for eg, "Automobiles > Self Driving Cars").
Our models can also automatically extract keywords and phrases from the transcription text. Below is an example of how the model works on a small sample of transcription text.
"Hi I'm joy. Hi I'm Sharon. Do you have kids in school? I have grandchildren in school. Okay, well, my kids are in middle school in high school. Do you think there is anything wrong with the school system Overcrowding, of course, ..."
"high school" "middle school" "kids" ...
Using the same dataset, we included the topics and keywords automatically detected in each file below:
Benchmark Your Own Data
Benchmarking accuracy amongst providers takes both time and money to run on your content. We offer complimentary benchmark reports for any team seeking a transcription solution. To get started with a complimentary benchmark report, you can go here.