Industry’s most accurate Speech AI models

Examine the performance of our Speech AI models across key metrics including accuracy, word error rate, and more.

Highest Word Accuracy Rate

AssemblyAI’s Universal model leads in accuracy, and is up to 40% more accurate than other speech-to-text models.

English
Multilingual
AssemblyAI
English
94.1%
Multilingual
91.3%
Amazon
English
92.4%
Multilingual
89.9%
ElevenLabs
English
93.5%
Multilingual
91.9%
Microsoft
English
92.5%
Multilingual
88.9%
Deepgram
English
92.1%
Multilingual
89.2%
OpenAI
English
92.4%
Multilingual
92.6%
80%
85%
90%
Dataset
AssemblyAI
Universal-3 Pro
Amazon
Amazon Transcribe
ElevenLabs
Scribe V2
Microsoft
Batch Transcription
Deepgram
Nova 3
OpenAI
Whisper
English
94.1%
92.4%
93.5%
92.5%
92.1%
92.4%
Multilingual
91.3%
89.9%
91.9%
88.9%
89.2%
92.6%
Average across all datasets - Updated February 2026

Lowest Word Error Rate

Fewer errors are critical to building successful AI applications around voice data—including summaries, customer insights, metadata tagging, action items, and more.

English
Multilingual
AssemblyAI
English
5.9%
Multilingual
8.7%
Amazon
English
7.6%
Multilingual
10.1%
ElevenLabs
English
6.5%
Multilingual
8.1%
Microsoft
English
7.5%
Multilingual
11.1%
Deepgram
English
8.1%
Multilingual
6.8%
OpenAI
English
6.5%
Multilingual
7.4%
0%
4%
8%
Dataset
AssemblyAI
Universal-3 Pro
Amazon
Amazon Transcribe
ElevenLabs
Scribe V2
Microsoft
Batch Transcription
Deepgram
Nova 3
OpenAI
Whisper
English
5.9%
7.6%
6.5%
7.5%
8.1%
6.5%
Multilingual
8.7%
10.1%
8.1%
11.1%
10.8%
7.4%
Average across all datasets - Updated February 2026

Consecutive Error Types per Audio Hour

Universal shows a 30% reduction in hallucination rates compared to Whisper Large-v3. We define hallucinations as five or more consecutive insertions, substitutions, or deletions.

Fabrications
Omissions
Hallucinations
AssemblyAI
English
6.6%
Omissions
5.3%
Hallucinations
7.3%
OpenAI
English
7.9%
Omissions
5.5%
Hallucinations
7.8%
0%
4%
8%
Metrics
AssemblyAI
Universal-3 Pro
OpenAI
Whisper
English

Automatically detect and replace profanity in the transcription text.

6.6%
7.9%
Omissions

Automatically detect and replace profanity in the transcription text.

5.3%
5.5%
Hallucinations

Automatically detect and replace profanity in the transcription text.

7.3%
7.8%
Average across all datasets - Updated February 2026

Consecutive Error Types per Audio Hour

Universal shows a 30% reduction in hallucination rates compared to Whisper Large-v3. We define hallucinations as five or more consecutive insertions, substitutions, or deletions.

Ground-truth
AssemblyAI
Universal-3 Pro
OpenAI
Whisper
her jewelry shimmered
her jewelry shimmering
hadja luis sima addjilu sime subtitles by the amara org community
the Taebaek mountain chain is often considered the backbone of the Korean Peninsula
the Taebaek mountain chain is often considered the backbone of the Korean Peninsula
the ride to price inte i daseline is about 3 feet tall and suites sizes is 하루
the englishman said nothing
the englishman said nothing
does that mean we should not have interessant n
not in a month of sundays
not in a month of sundays
this time i am very happy and then thank you to my co workers get them back to jack corn again thank you to all of you who supported me the job you gave me ultimately gave me nothing however i thank all of you for supporting me thank you to everyone at jack corn thank you to michael john song trabalhar significant
Average across all datasets

English Word Error Rate per dataset

Dataset
AssemblyAI
Universal-3 Pro
Amazon
Amazon Transcribe
ElevenLabs
Scribe V2
Microsoft
Batch Transcription
Deepgram
Nova 3
OpenAI
Whisper
CommonVoice
4.13%
5.16%
5.38%
7.76%
10.45%
8.52%
Meanwhile
3.58%
5.05%
2.91%
6.94%
5.28%
4.51%
Noisy
9.97%
24.73%
13.72%
14.26%
14.12%
11.63%
Podcast
6.65%
11.23%
10.90%
11.37%
10.23%
10.32%
Rev16
7.93%
11.30%
10.08%
11.23%
10.81%
11.61%
LibriSpeech Clean
1.46%
2.05%
2.17%
2.32%
2.56%
2.28%
LibriSpeech Test-Other
2.56%
4.30%
3.05%
5.07%
5.48%
4.64%
Broadcast (internal)
4.24%
5.33%
7.30%
6.06%
5.85%
4.75%%
Earnings 2021
9.70%
8.37%
6.61%
7.82%
11.38%
9.87%
Webinar
5.51%
10.12%
9.78%
10.07%
9.54%
6.99%
Tedlium
7.22%
6.18%
6.03%
6.60%
6.36%
8.70%
Average
5.72%
8.14%
7.08%
8.14%
8.38%
7.45%
Updated February 2026

Multilingual Word Error Rate per dataset

Dataset
AssemblyAI
Universal-3 Pro
Amazon
Amazon Transcribe
ElevenLabs
Scribe V2
Microsoft
Batch Transcription
Deepgram
Nova 3
OpenAI
Whisper
CommonVoice
6.80%
7.75%
4.22%
10.10%
11.09%
8.46%
Fleurs
6.09%
8.72%
6.20%
9.91%
10.60%
5.88%
Voxpopuli
13.65%
10.20%
13.80%
12.49%
13.00%
12.9%
Private Call Center
12.58%
16.24%
10.76%
17.99%
16.59%
9.92%
Average
9.78%
10.73%
8.75%
12.62%
12.82%
9.29%
Updated February 2026

Benchmark Report Methodology

250+ hours of audio data
80,000+ audio files
26 datasets
Datasets

This benchmark was performed using multiple open-source datasets (Commonvoice, Fleurs, Voxpopuli) and additional in-house datasets curated by AssemblyAI. For the in-house datasets, we sourced 60+ hours of human-labeled audio data covering popular speech domains such as call centers, podcasts, broadcasts, and webinars. Collectively, these datasets comprise a diverse set of English audio that spans phone calls, broadcasts, accented speech, and heavy jargon.

Methodology

We measured the performance of each provider on 7 datasets. For each vendor, we made API calls to their most accurate model for each file, and for Whisper we generated outputs using a self-hosted instance of Whisper-Large-V3. After receiving results for each file from each vendor, we normalized both the prediction from the model and the ground truth transcript using the open-source Whisper Normalizer. From there, we calculated the average metrics for each file across datasets to measure performance.

Unlock the value of voice data

Build what’s next on the platform powering thousands of the industry’s leading of Voice AI apps.