Benchmarks

Compare AssemblyAI

See how our models stack up on accuracy, entity recognition, diarization, multilingual support, latency, and cost.

Word error rate

WER is calculated as (substitutions + insertions + deletions) / total words in the reference transcript. It is the standard metric for evaluating speech-to-text accuracy.

Average normalized WER across selected datasets

*lower is better*

4.50%

5.24%

5.34%

5.50%

5.87%

5.91%

6.47%

6.66%

7.02%

7.16%

17.39%

AssemblyAI Universal-3 Pro

Mistral Voxtral Mini

OpenAI GPT-4o Transcribe

Cohere Transcribe

ElevenLabs Scribe V2

Qwen3 ASR

Gladia

Deepgram Nova-3

Azure Batch

Grok

Soniox

WER by dataset

Dataset	AssemblyAI Universal-3 Pro	ElevenLabs Scribe V2	Mistral Voxtral Mini	OpenAI GPT-4o Transcribe	Cohere Transcribe	Qwen3 ASR	Gladia	Deepgram Nova-3	Azure Batch	Grok	Soniox
Synthetic medical	0.25%	0.41%	1.25%	0.55%	1.33%	1.15%	1.23%	0.51%	1.55%	1.38%	0.72%
Accented English (India)	5.62%	5.92%	6.44%	6.49%	6.39%	6.61%	6.78%	7.77%	8.04%	6.60%	53.48%
General speech	6.29%	7.19%	6.64%	7.45%	8.35%	7.01%	11.08%	8.81%	8.42%	9.76%	7.55%
Webinar speech	5.86%	9.96%	6.65%	6.87%	5.91%	8.85%	6.79%	9.55%	10.07%	10.90%	7.79%
Average	4.50%	5.87%	5.24%	5.34%	5.50%	5.91%	6.47%	6.66%	7.02%	7.16%	17.39%

Missed entity rate

Missed entity rate measures errors on named entity categories including names, organizations, locations, medical terms, money, occupations, temporal expressions, and URLs. Unlike aggregate WER, it isolates the tokens that carry the most semantic weight in downstream applications.

Missed entity rate by provider

*lower is better*

Name

Organization

Location

Medical

Money

Occupation

Temporal

Url

23.6

17.0

8.6

13.2

43.6

9.0

8.2

50.5

25.9

20.8

10.0

19.8

39.2

9.4

6.2

65.0

25.4

21.0

12.4

18.4

30.5

9.9

9.8

71.8

29.2

23.4

12.2

16.1

37.3

13.0

13.8

65.5

20.9

16.6

11.5

10.8

76.9

9.1

20.4

47.1

25.6

19.3

13.6

15.7

76.8

9.7

19.4

54.8

36.1

31.6

24.3

31.1

80.4

10.6

29.7

98.1↑

AssemblyAI Universal-3 Pro

Mistral Voxtral Mini

AssemblyAI Universal-2

OpenAI GPT-4o Transcribe

ElevenLabs Scribe V2

Deepgram Nova-3

NVIDIA Canary 1B

MER by entity type

Entity type	AssemblyAI Universal-3 Pro	Mistral Voxtral Mini	AssemblyAI Universal-2	OpenAI GPT-4o Transcribe	ElevenLabs Scribe V2	Deepgram Nova-3	NVIDIA Canary 1B
Name	23.63%	25.89%	25.37%	29.23%	20.87%	25.58%	36.08%
Organization	17.02%	20.80%	20.96%	23.40%	16.57%	19.25%	31.60%
Location	8.61%	9.98%	12.40%	12.21%	11.54%	13.57%	24.27%
Medical	13.15%	19.78%	18.43%	16.07%	10.80%	15.69%	31.07%
Money	43.56%	39.22%	30.54%	37.30%	76.88%	76.82%	80.40%
Occupation	9.03%	9.43%	9.86%	13.01%	9.12%	9.67%	10.59%
Temporal	8.17%	6.21%	9.82%	13.78%	20.42%	19.42%	29.70%
Url	50.49%	65.05%	71.84%	65.53%	47.09%	54.79%	98.06%

Diarization

Diarization segments audio by speaker. We report cpWER (concatenated minimum-permutation word error rate), which jointly evaluates transcription and speaker assignment.

Average cpWER across DiPCo and NOTSOFAR

*lower is better*

33.34%

43.21%

44.04%

46.11%

46.23%

57.14%

112.28%

AssemblyAI Universal-3 Pro

Deepgram

Gladia

Speechmatics

Grok

Google

Soniox

Diarization summary

Provider	Average cpWER
AssemblyAI Universal-3 Pro	33.34%
Deepgram	43.21%
Gladia	44.04%
Speechmatics	46.11%
Grok	46.23%
Google	57.14%
Soniox	112.28%

Multilingual

Global WER across the evaluated multilingual benchmark suite, with language-level breakdowns where available.

Global multilingual WER

*lower is better*

8.22%

8.23%

9.52%

10.50%

11.13%

13.36%

14.39%

15.71%

Speechmatics Enhanced

AssemblyAI Universal-3 Pro

OpenAI GPT-4o Transcribe

Mistral Voxtral Mini

ElevenLabs Scribe V2

Cohere Transcribe

OpenAI Whisper-1

Deepgram Nova-3

Global WER and language breakdown

Provider	Global WER	Evaluated languages	German	Spanish	French	Italian	Portuguese
Speechmatics Enhanced	8.22%	5/20	10.30%	5.84%	10.09%	7.88%	7.01%
AssemblyAI Universal-3 Pro	8.23%	5/20	9.19%	6.21%	10.63%	7.79%	7.34%
OpenAI GPT-4o Transcribe	9.52%	15/20	—	—	—	—	—
Mistral Voxtral Mini	10.50%	10/20	—	—	—	—	—
ElevenLabs Scribe V2	11.13%	19/20	—	—	—	—	—
Cohere Transcribe	13.36%	—	—	—	—	—	—
OpenAI Whisper-1	14.39%	—	—	—	—	—	—
Deepgram Nova-3	15.71%	—	—	—	—	—	—

Demo

Listen to the difference

Play the preloaded sample and compare the transcripts side by side.

Medication name changes

Ramipril becomes a different medication, which can change the patient record.

0:00 / 0:00

import assemblyai as aai
aai.settings.api_key = "YOUR_API_KEY"
config = aai.TranscriptionConfig(
    speech_model="universal-3-pro",
)

transcript = aai.Transcriber().transcribe(
    "patient-intake.mp3", config=config
)

print(transcript.text)

Truth

I take this medication called, I think it's like a Lipitor. And then I take Ramipril for the blood pressure.

AssemblyAI

I take this medication called, like, it's like a Lipitor. And then I take, um, Ramipril for the blood pressure.

Competitor

I take this medication called like, it's like a Lipitor. And then I take olanopril for the blood pressure.

Benchmarks

Run your own benchmark

Test on your own audio. Want to do it right? Read our benchmarking guide in the docs.

Learn more Get a custom benchmark

Word error rate

WER is calculated as (substitutions + insertions + deletions) / total words in the reference transcript. Streaming models transcribe with less context than pre-recorded. Accuracy evaluated on final transcripts.

Average WER

*lower is better*

5.53%

8.87%

8.89%

9.39%

10.10%

10.64%

AssemblyAI Universal-3 Pro Streaming

Deepgram Flux

AssemblyAI Universal Streaming

Deepgram Nova 3

Cartesia

Deepgram Nova 3 Multi

Benchmarks

Run your own benchmark

Test on your own audio. Want to do it right? Read our benchmarking guide in the docs.

Learn more Get a custom benchmark

Methodology

: Each audio file was sent to every provider's production API using default settings. No custom model tuning or prompt engineering was applied. All providers were tested on identical audio files under the same conditions.
: Transcription outputs were normalized using the OpenAI Whisper text normalizer before WER computation. This removes formatting differences (casing, punctuation, number formats) so that accuracy comparisons reflect transcription quality, not output styling.
: We evaluate across benchmark datasets spanning general speech, entity recognition, diarization, multilingual audio, and code switching.

Word error rate

Selected speech sets covering synthetic medical, accented English, general speech, and webinar audio.

Missed entity rate

PrivateAI Named Entities.

Multilingual

Common Voice, FLEURS, and VoxPopuli multilingual speech datasets.

Code Switching

Common Voice-derived English paired with German, Spanish, French, Italian, and Portuguese.

Diarization

DiPCo, the Dinner Party Corpus, and NOTSOFAR.

Last updated June 8, 2026.

Compare AssemblyAI

Word error rate

Missed entity rate

Diarization

Multilingual

Code switching

Price-performance

Listen to the difference

Run your own benchmark

Word error rate

Semantic word error rate

Missed entity rate

Time to Complete Turn (TTCT)

Price-performance

Run your own benchmark

Methodology