Benchmarks

Compare AssemblyAI

See how our models stack up on accuracy, entity recognition, diarization, multilingual support, latency, and cost.

Word error rate

WER is calculated as (substitutions + insertions + deletions) / total words in the reference transcript. It is the standard metric for evaluating speech-to-text accuracy.

Average normalized WER across selected datasets

*lower is better*

4.50%
5.24%
5.34%
5.50%
5.87%
5.91%
6.47%
6.66%
7.02%
7.16%
17.39%

AssemblyAI Universal-3 Pro

Mistral Voxtral Mini

OpenAI GPT-4o Transcribe

Cohere Transcribe

ElevenLabs Scribe V2

Qwen3 ASR

Gladia

Deepgram Nova-3

Azure Batch

Grok

Soniox

WER by dataset
Dataset AssemblyAI Universal-3 Pro ElevenLabs Scribe V2 Mistral Voxtral Mini OpenAI GPT-4o Transcribe Cohere Transcribe Qwen3 ASR Gladia Deepgram Nova-3 Azure Batch Grok Soniox
Synthetic medical 0.25% 0.41% 1.25% 0.55% 1.33% 1.15% 1.23% 0.51% 1.55% 1.38% 0.72%
Accented English (India) 5.62% 5.92% 6.44% 6.49% 6.39% 6.61% 6.78% 7.77% 8.04% 6.60% 53.48%
General speech 6.29% 7.19% 6.64% 7.45% 8.35% 7.01% 11.08% 8.81% 8.42% 9.76% 7.55%
Webinar speech 5.86% 9.96% 6.65% 6.87% 5.91% 8.85% 6.79% 9.55% 10.07% 10.90% 7.79%
Average 4.50% 5.87% 5.24% 5.34% 5.50% 5.91% 6.47% 6.66% 7.02% 7.16% 17.39%
Demo

Listen to the difference

Play the preloaded sample and compare the transcripts side by side.

Medication name changes

Ramipril becomes a different medication, which can change the patient record.

Truth

I take this medication called, I think it's like a Lipitor. And then I take Ramipril for the blood pressure.

AssemblyAI

I take this medication called, like, it's like a Lipitor. And then I take, um, Ramipril for the blood pressure.

Competitor

I take this medication called like, it's like a Lipitor. And then I take olanopril for the blood pressure.

Benchmarks

Run your own benchmark

Test on your own audio. Want to do it right? Read our benchmarking guide in the docs.

Methodology

Last updated June 8, 2026.