Industry’s most accurate Speech AI models

Examine the performance of our Speech AI models across key metrics including accuracy, word error rate, and more.

Get started Contact sales

Highest Word Accuracy Rate

AssemblyAI’s Universal model leads in accuracy, and is up to 40% more accurate than other speech-to-text models.

English

Spanish

German

AssemblyAI

English

93.4%

Spanish

94.7%

German

92.7%

Amazon

English

90.9%

Spanish

93.8%

German

91.8%

Google

English

85.81%

Spanish

88.2%

German

90.4%

Microsoft

English

91.8%

Spanish

92.9%

German

91.3%

Deepgram

English

91.4%

Spanish

93.3%

German

89.2%

OpenAI

English

92.1%

Spanish

94.4%

German

90.8%

80%

85%

90%

Dataset	AssemblyAI Universal	Amazon Amazon Transcribe	Google Latest-long	Microsoft Azure Batch v3.1	Deepgram Nova 3	OpenAI Whisper
English	93.4%	90.9%	85.81%	91.8%	91.4%	92.1%
Spanish	94.7%	93.8%	88.2%	92.9%	93.3%	94.4%
German	92.7%	91.8%	90.4%	91.3%	89.2%	90.8%

Average across all datasets - Updated October 2025

Lowest Word Error Rate

Fewer errors are critical to building successful AI applications around voice data—including summaries, customer insights, metadata tagging, action items, and more.

English

Spanish

German

AssemblyAI

English

6.6%

Spanish

5.3%

German

7.3%

Amazon

English

9.0%

Spanish

6.2%

German

8.2%

Google

English

14.2%

Spanish

11.7%

German

13.8%

Microsoft

English

8.6%

Spanish

7.1%

German

8.7%

Deepgram

English

8.6%

Spanish

6.7%

German

10.8%

OpenAI

English

7.9%

Spanish

5.5%

German

9.2%

Dataset	AssemblyAI Universal	Amazon Amazon Transcribe	Google Latest-long	Microsoft Azure Batch v3.1	Deepgram Nova 3	OpenAI Whisper
English	6.6%	9.0%	14.2%	8.6%	8.6%	7.9%
Spanish	5.3%	6.2%	11.7%	7.1%	6.7%	5.5%
German	7.3%	8.2%	13.8%	8.7%	10.8%	9.2%

Average across all datasets - Updated October 2025

Consecutive Error Types per Audio Hour

Universal shows a 30% reduction in hallucination rates compared to Whisper Large-v3. We define hallucinations as five or more consecutive insertions, substitutions, or deletions.

Fabrications

Omissions

Hallucinations

AssemblyAI

English

6.6%

Omissions

5.3%

Hallucinations

7.3%

OpenAI

English

7.9%

Omissions

5.5%

Hallucinations

7.8%

Metrics	AssemblyAI Universal	OpenAI Whisper
English Automatically detect and replace profanity in the transcription text.	6.6%	7.9%
Omissions Automatically detect and replace profanity in the transcription text.	5.3%	5.5%
Hallucinations Automatically detect and replace profanity in the transcription text.	7.3%	7.8%

Average across all datasets - Updated October 2025

Consecutive Error Types per Audio Hour

Universal shows a 30% reduction in hallucination rates compared to Whisper Large-v3. We define hallucinations as five or more consecutive insertions, substitutions, or deletions.

Ground-truth	AssemblyAI Universal	OpenAI Whisper
her jewelry shimmered	her jewelry shimmering	hadja luis sima addjilu sime subtitles by the amara org community
the taebaek mountain chain is often considered the backbone of the korean peninsula	the tabet mountain chain is often considered the backbone of the korean venezuela	the ride to price inte i daseline is about 3 feet tall and suites sizes is 하루
the englishman said nothing	there's an englishman said nothing	does that mean we should not have interessant n
not in a month of sundays	marine a month of sundays	this time i am very happy and then thank you to my co workers get them back to jack corn again thank you to all of you who supported me the job you gave me ultimately gave me nothing however i thank all of you for supporting me thank you to everyone at jack corn thank you to michael john song trabalhar significant

Average across all datasets

English Word Error Rate per dataset

Dataset	AssemblyAI Universal	Amazon Amazon Transcribe	Google Latest-long	Microsoft Azure Batch v3.1	Deepgram Nova 3	OpenAI Whisper
CommonVoice v5.1	6.48%	6.38%	17.42%	7.76%	10.45%	8.51%
Meanwhile	4.41%	6.13%	11.75%	6.96%	5.28%	4.10%
Noisy	10.43%	24.73%	26.94%	14.26%	14.12%	11.98%
Podcast	10.30%	11.23%	15.87%	11.24%	11.00%	11.15%
Telephony (internal)	8.56%	11.30%	16.06%	11.38%	10.08%	10.24%
LibriSpeech Clean	1.68%	2.05%	5.56%	2.32%	2.56%	2.29%
LibriSpeech Test-Other	3.00%	4.30%	11.58%	5.07%	5.48%	4.64%
Broadcast (internal)	4.39%	5.59%	8.73%	6.06%	5.85%	4.84%
Earnings 2021	9.37%	8.37%	14.85%	7.82%	11.45%	9.77%
Webinar	5.85%	10.12%	12.18%	10.06%	9.54%	6.87%
TEDLIUM	7.30%	6.18%	8.87%	6.60%	6.37%	7.58%
Average	6.52%	8.76%	13.62%	8.14%	8.38%	7.45%

Last updated in October 2025

Spanish Word Error Rate per dataset

Dataset	AssemblyAI Universal	Amazon Amazon Transcribe	Google Latest-long	Microsoft Azure Batch v3.1	Deepgram Nova 2	OpenAI Whisper
CommonVoice v9	3.96%	4.29%	7.10%	6.38%	7.34%	4.66%
Voxpopuli	8.22%	8.67%	17.4%	8.89%	8.75%	8.63%
Fleurs	3.65%	7.06%	12.1%	4.77%	5.81%	2.84%
Multilingual LS	3.94%	4.13%	9.89%	6.07%	4.26%	4.34%
Average	5.27%	6.21%	11.76%	7.12%	6.70%	5.56%

Updated in October 2025

German Word Error Rate per dataset

Dataset	AssemblyAI Universal	Amazon Amazon Transcribe	Google Latest-long	Microsoft Azure Batch v3.1	Deepgram Nova 2	OpenAI Whisper
CommonVoice v9	4.18%	5.66%	10.27%	6.04%	9.19%	5.88%
Private	8.20%	10.60%	13.21%	9.23%	10.34%	8.53%
Multilingual LS	3.75%	3.18%	14.11%	4.55%	7.24%	6.19%
Voxpopuli	12.64%	14.66%	10.70%	12.50%	14.80%	11.18%
Fleurs	7.95%	11.34%	5.00%	8.79%	11.47%	7.31%
Average	7.34%	8.21%	9.64%	8.65%	11.58%	9.21%

Updated in October 2025

Benchmark Report Methodology

250+ hours of audio data

80,000+ audio files

26 datasets

Datasets

This benchmark was performed using 3 open-source datasets (LibriSpeech, Rev16, and Meanwhile) and 4 in-house datasets curated by AssemblyAI. For the in-house datasets, we sourced 60+ hours of human-labeled audio data covering popular speech domains such as call centers, podcasts, broadcasts, and webinars. Collectively, these datasets comprise a diverse set of English audio that spans phone calls, broadcasts, accented speech, and heavy jargon.

Methodology

We measured the performance of each provider on 7 datasets. For each vendor, we made API calls to their most accurate model for each file, and for Whisper we generated outputs using a self-hosted instance of Whisper-Large-V3. After receiving results for each file from each vendor, we normalized both the prediction from the model and the ground truth transcript using the open-source Whisper Normalizer. From there, we calculated the average metrics for each file across datasets to measure performance.

Submit your email to download a PDF of the benchmark results

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Unlock the value of voice data

Build what’s next on the platform powering thousands of the industry’s leading of Voice AI apps.