Industry’s most accurate Speech AI models

Examine the performance of our Speech AI models across key metrics including accuracy, word error rate, and more.

Highest Word Accuracy Rate

AssemblyAI’s Universal-1 model leads in accuracy, and is up to 40% more accurate than other speech-to-text models.

English

Spanish

German

80%

90%

95%

AssemblyAI

OpenAI

Microsoft

Deepgram

Amazon

Google

Dataset

AssemblyAI

Universal-1

OpenAI

Whisper Large-v3

Microsoft

Azure Batch v3.1

Deepgram

Nova 2

Amazon

Amazon Transcribe

Google

Latest-long

English
92.7%
91.6%
90.6%
90.5%
89.4%
86.4%
Spanish
95.2%
94.0%
92.8%
92.4%
94.2%
91.3%
German
92.5%
91.6%
91.1%
88.8%
91.1%
87.3%

Average across all datasets

Lowest Word Error Rate

Fewer errors are critical to building successful AI applications around voice data—including summaries, customer insights, metadata tagging, action items, and more.

English

Spanish

German

0%

4%

8%

12%

AssemblyAI

OpenAI

Microsoft

Deepgram

Amazon

Google

Dataset

AssemblyAI

Universal-1

OpenAI

Whisper Large-v3

Microsoft

Azure Batch v3.1

Deepgram

Nova 2

Amazon

Amazon Transcribe

Google

Latest-long

English
7.3%
8.4%
9.4%
9.5%
10.6%
13.6%
Spanish
4.8%
6.0%
7.2%
7.6%
5.8%
8.7%
German
7.5%
8.4%
8.9%
11.2%
8.9%
12.7%

Average across all datasets

Consecutive Error Types per Audio Hour

Universal-1 shows a 30% reduction in hallucination rates compared to Whisper Large-v3. We define hallucinations as five or more consecutive insertions, substitutions, or deletions.

Fabrications

Omissions

Hallucinations

0%

10%

15%

20%

AssemblyAI

Whisper Large-v3

Metrics

AssemblyAI

Universal-1

OpenAI

Whisper Large-v3

Fabrications
5.2%
8.8%
Omissions
5.6%
7.1%
Hallucinations
12.9%
18.4%

Average across all datasets

See the difference

The examples below show the difference between Universal-1's performance and Whisper Large-v3's hallucinations.

Ground-truth
AssemblyAI

Universal-1

OpenAI

Whisper Large-V3

her jewelry shimmered

her jewelry shimmering

hadja luis sima addjilu sime subtitles by the amara org community

the taebaek mountain chain is often considered the backbone of the korean peninsula

the tabet mountain chain is often considered the backbone of the korean venezuela

the ride to price inte i daseline is about 3 feet tall and suites sizes is 하루

the englishman said nothing

there's an englishman said nothing

does that mean we should not have interessant n

not in a month of sundays

marine a month of sundays

this time i am very happy and then thank you to my co workers get them back to jack corn again thank you to all of you who supported me the job you gave me ultimately gave me nothing however i thank all of you for supporting me thank you to everyone at jack corn thank you to michael john song trabalhar significant

English Word Error Rate per dataset

Dataset

AssemblyAI

Universal-2

AssemblyAI

Universal-1

OpenAI

Whisper Large-v3

Deepgram

Nova 2

Microsoft

Azure Batch v3.1

Amazon

Amazon Transcribe

Google

Latest-long

CommonVoice v5.1
7.08%
7.32%
8.83%
12.43%
7.81%
8.98%
17.59%
Meanwhile
4.77%
4.58%
9.75%
5.56%
6.73%
7.27%
11.67%
Noisy
10.19%
10.31%
11.86%
16.03%
16.13%
27.62%
26.63%
Podcast
8.25%
8.48%
8.93%
9.12%
9.74%
10.55%
14.22%
Telephony (internal)
11.02%
10.78%
12.55%
13.22%
16.04%
16.08%
23.83%
LibriSpeech Clean
1.71%
1.95%
2.22%
3.13%
2.74%
2.87%
6.46%
LibriSpeech Test-Other
3.27%
3.49%
4.09%
7.36%
6.15%
6.64%
13.20%
Broadcast (internal)
4.06%
4.87%
4.55%
5.53%
6.01%
5.92%
8.61%
Earnings 2021
9.39%
9.80%
9.55%
11.05%
7.58%
8.33%
14.56%
CORAAL
16.74%
17.92%
19.08%
16.86%
17.97%
19.62%
33.71%
TEDLIUM
7.02%
7.25%
7.30%
8.98%
9.27%
9.12%
11.69%
Average
7.59%
7.89%
8.97%
9.93%
9.65%
11.18%
16.56%

Spanish Word Error Rate per dataset

Dataset

AssemblyAI

Universal-1

OpenAI

Whisper Large-v3

Microsoft

Azure Batch v3.1

Deepgram

Nova 2

Amazon

Amazon Transcribe

Google

Latest-long

CommonVoice v9
3.4%
5.0%
7.5%
8.0%
4.7%
7.1%
Private
4.6%
6.6%
8.9%
8.2%
6.2%
13.7%
Multilingual LS
3.3%
5.7%
5.7%
5.8%
3.4%
7.2%
Voxpopuli
7.5%
9.6%
8.8%
9.3%
8.6%
10.7%
Fleurs
5.0%
2.8%
4.9%
7.0%
6.2%
5.0%
Average
4.8%
6.0%
7.2%
7.6%
5.8%
8.7%

German Word Error Rate per dataset

Dataset

AssemblyAI

Universal-1

OpenAI

Whisper Large-v3

Microsoft

Azure Batch v3.1

Deepgram

Nova 2

Amazon

Amazon Transcribe

Google

Latest-long

CommonVoice v9
4.2%
6.0%
7.4%
9.4%
5.8%
10.1%
Private
7.4%
8.7%
9.5%
11.1%
9.7%
12.1%
Voxpopuli
12.2%
12.6%
14.7%
15.1%
16.8%
17.4%
Fleurs
10.2%
7.5%
7.2%
12.4%
9.0%
12.1%
Multilingual LS
3.6%
7.4%
5.7%
8.2%
3.2%
12.0%
Average
7.5%
8.4%
8.9%
11.2%
8.9%
12.7%

Benchmark Report Methodology

250+

hours of audio data

80,000+

audio files

26

datasets

Datasets

This benchmark was performed using 3 open-source datasets (LibriSpeech, Rev16, and Meanwhile) and 4 in-house datasets curated by AssemblyAI. For the in-house datasets, we sourced 60+ hours of human-labeled audio data covering popular speech domains such as call centers, podcasts, broadcasts, and webinars. Collectively, these datasets comprise a diverse set of English audio that spans phone calls, broadcasts, accented speech, and heavy jargon.

Methodology

We measured the performance of each provider on 7 datasets. For each vendor, we made API calls to their most accurate model for each file, and for Whisper we generated outputs using a self-hosted instance of Whisper-Large-V3. After receiving results for each file from each vendor, we normalized both the prediction from the model and the ground truth transcript using the open-source Whisper Normalizer. From there, we calculated the average metrics for each file across datasets to measure performance.

AssemblyAI combines industry-leading accuracy with a robust feature set

Language Detection

Speaker Labels

Word Timings

Real-time Streaming

Custom Vocabulary

Dual Channel

LLM Text Generation

Profanity Filtering

Speech Threshold

Advanced PII Redaction

Extract maximum value from voice data

AssemblyAI’s models deliver the highest accuracy and an expansive set of capabilities, making it easy to build on top of voice data.

Products overview
Rapid innovation, constant iteration

AssemblyAI ships new features, model improvements, and product updates weekly to ensure you have access to cutting-edge Speech AI capabilities.

View changelog

Submit your email to download
a PDF of the benchmark results

Or get started in seconds with the AssemblyAI API

Trusted by thousands of developers in businesses across every industry

"With AssemblyAI, we get a trusted partner that supports us in enhancing the value we deliver for our customers."

Ryan Johnson, Chief Product Officer, CallRail

It’s fast and free to start building with AssemblyAI

1
2
3
4
5
6
import assemblyai as aai

transcriber = aai.Transcriber()
transcript = transcriber.transcribe(URL, config)

print(transcript)