June 15, 2021

Speech-to-Text Accuracy on Podcasts, News Broadcasts, and Social Media

In this report, we look at 12 different audio/video files from various sources, and review how accurately AssemblyAI, AWS Transcribe, and Google Speech-to-Text, are able to automatically transcribe each file.

Automatic Speech Recognition

Joe Zaghloul

Reviewed by

No items found.

Table of contents

[Visible on live site]

Get $50 in credits

Media companies like NBC, social platforms like YouTube, and media monitoring solutions like Meltwater use Speech-to-Text technology to power everything from closed captioning to complex content moderation. Developers can now leverage Speech Recognition to unlock more insight into their content quickly and affordably at scale.

In this report, we look at 12 different audio/video files from various sources (shown in more detail below), and review how accurately AssemblyAI, AWS Transcribe, and Google Speech-to-Text, are able to automatically transcribe these files.

In addition to reviewing Speech Recognition accuracy, our research team reviewed the results of AssemblyAI's unique Content Safety, Topic Recognition, and Keyword Detection features on these 12 audio/video files. This report is meant to serve as a point of reference to compare the best solutions Automated Speech Recognition solutions in the market.

Speech Recognition Accuracy

Our Dataset

We scoured the internet for a wide variety of media content— from news broadcasts on current events, to user generated social media videos on TikTok, to public podcasts on Spotify. Chosen at random, our intention is to provide you with a healthy sample size of content types from an array of sources as you begin your own comparative analysis.

Here is more about our dataset below:

How We Calculate Accuracy

First, we transcribe the files in our dataset automatically through APIs (AssemblyAI, Google, and AWS).
Second, we transcribe the files in our dataset by human transcriptionists—to approximately 100% accuracy.
Finally, we compare the API's transcription with our human transcription to calculate Word Error Rate (WER)—more below.

Below, we outline the accuracy score that each transcription API achieved on each audio file. Each result is hyperlinked to a diff of the human transcript versus each API's automatically generated transcript. This helps to highlight the key differences between the human transcripts and the automated transcripts.

Accuracy Averages

WER Methodology

The above accuracy scores were calculated using the Word Error Rate (WER). WER is the industry-standard metric for calculating the accuracy of an Automatic Speech Recognition system. The WER compares the API generated transcription to the human transcription for each file, counting the number of insertions, deletions, and substitutions made by the automated system in order to derive the WER.

Before calculating the WER for a particular file, both the truth (human transcriptions) and the automated transcriptions must be normalized into the same format. This is a common step that teams miss, which ends up resulting in misleading WER numbers. That's because according to the WER algorithm, "Hello" and "Hello!" are two distinct words, since one has an exclamation and the other does not. That's why, to perform the most accurate WER analysis, all punctuation and casing is removed from both the human and automated transcripts, and all numbers are converted to the spoken format, as outlined below.

For example:‍

truth -> Hi my name is Bob I am 72 years old. normalized truth -> hi my name is bob i am seventy two years old

Content Safety Detection

Protecting brands and users from harmful content is crucial to a number of applications and use cases. With AssemblyAI's Content Safety Detection feature, we can flag portions of your transcription that contain sensitive content such as hate speech, gambling, weapons, or violence. The full list of what the AssemblyAI API will flag can be found in the API documentation.

To moderate audio and video content on the internet today, large teams of people are required to manually go through this content in order to flag anything that might be abusive or in violation of a platform’s policies. For example, Facebook employs tens of thousands of people to manually review posts to their platform, and to flag those posts that include hate speech.

Legacy software that attempts to automate this process takes a "blacklist" approach - which looks for specific words (for eg, "gun") in order to flag sensitive content. This approach is extremely error prone and brittle. Take for example "That burrito is bomb".

Since AssemblyAI's Content Safety Detection model is built using State of the Art Deep Learning models, our model looks at the entire context of a word/sentence when deciding when to flag a piece of content or not - we don't rely on error-prone backlist approaches.

Below we review the results of AssemblyAI's Content Safety Detection feature on the above dataset:

Topic Detection

In addition to Content Safety Detection, we reviewed the topics and keywords detected by AssemblyAI's Topic Detection feature. Developers can use the Topic Detection feature to understand the topics discussed in their audio/video files. This information can be useful to drive better indexing of content, recommendations, and even to aid with ad targeting. Our Topic Detection feature uses the IAB Taxonomy, and can classify transcription text with up to 698 possible IAB categories (for eg, "Automobiles > Self Driving Cars").

Key‍word Detection

Our models can also automatically extract keywords and phrases from the transcription text. Below is an example of how the model works on a small sample of transcription text.

Original transcription:

"Hi I'm joy. Hi I'm Sharon. Do you have kids in school? I have grandchildren in school. Okay, well, my kids are in middle school in high school. Do you think there is anything wrong with the school system Overcrowding, of course, ...""high school" "middle school" "kids" ...

Using the same dataset, we included the topics and keywords automatically detected in each file below:

Benchmark Your Own Data

Benchmarking accuracy amongst providers takes both time and money to run on your content. We offer complimentary benchmark reports for any team seeking a transcription solution. To get started with a complimentary benchmark report, you can go here.‍‍

Speech-to-Text Accuracy on Podcasts, News Broadcasts, and Social Media

Speech Recognition Accuracy

Our Dataset

How We Calculate Accuracy

Accuracy Averages

WER Methodology

Content Safety Detection

Topic Detection

Key‍word Detection

Benchmark Your Own Data

Easy C# Speech Recognition

Transcribe audio and video files with Python and Universal

Convert Speech to Text in Python in 5 Minutes

What is Automatic Speech Recognition? A Comprehensive Overview of ASR Technology

MinImagen - Build Your Own Imagen Text-to-Image Model

How to Create SRT Files for Videos in Node.js

Automatically redact PII from audio and video with Python

AI-powered meeting company Supernormal launches customizable Voice Agents

Speech-to-Text Accuracy on Podcasts, News Broadcasts, and Social Media

Speech Recognition Accuracy

Our Dataset

How We Calculate Accuracy

Accuracy Averages

WER Methodology

Content Safety Detection

Topic Detection

Key‍word Detection

Benchmark Your Own Data

Related posts

Easy C# Speech Recognition

Transcribe audio and video files with Python and Universal

Convert Speech to Text in Python in 5 Minutes

What is Automatic Speech Recognition? A Comprehensive Overview of ASR Technology

MinImagen - Build Your Own Imagen Text-to-Image Model

How to Create SRT Files for Videos in Node.js

Automatically redact PII from audio and video with Python

AI-powered meeting company Supernormal launches customizable Voice Agents