Speech-to-text evals that actually understand model performance

Your WER benchmark might be lying to you. Many speech-to-text models look good in demos,
then fail on real audio (noise, numbers, domain terms, overlap, edge cases).

Compare your truth files

Which evaluation metric is right for your use-case?

Speech-to-text accuracy can be measured across standard ASR benchmarks such as WER, or using domain-specific evals such as Missed Entity Rate (MER) and Semantic WER.

Build a find-and-replace list for domain-specific equivalences (healthcare/health care, alright/all right) before running WER calculations.

Compare speech-to-text models

Not all errors are equal

WER treats "gonna → going to" identically to "lisinopril → listening a pill." One preserves all context. One breaks your downstream pipeline.

Better models often get penalized for catching what humans missed in “ground truth.” Use the Truth File Corrector to surface disagreements between your ground truth and the model, listen back, and update your reference transcripts.

Update your truth files

Real-World Performance - Speech-to-Text Benchmarks

Results across real-world datasets. WER shown alongside Missed Entity Rate (MER) — the metric that matters when specific words drive downstream decisions.

Model	Average WER	MER - Medical Terms	MER - Date & Time	MER - Locations
AssemblyAI Universal-3 Pro	5.93%	13.61%	7.5%	8.26%
OpenAI GPT-4o	6.87%	16.50%	12.29%	12.15%
Deepgram Nova-3	7.9%	16.95%	18.69%	13.94%

WER calculated across 26 public datasets

Word Error Rate vs Missed Entity Rate - Streaming Benchmarks

Build voice agents that sound natural and understand context for downstream conversations. *Lower is better

Model	Average Streaming WER	Average Streaming MER
AssemblyAI Universal-3 Pro Streaming	8.14%	16.7%
OpenAI GPT-4o	9.90%	23.3%
Deepgram Nova-3	11.06%	25.2%

WER calculated across 26 public datasets

Fix your benchmarks

Truth File Corrector Tool

Submit your existing truth files and your audio to surface every discrepancy between the AI output and your human truth file.

Correct Ground Truth Files

Semantic Word List Generation

Use Truth File Corrector to build an array of word groupings that should be treated as equivalent in your evaluations.

Create Semantic Word Lists

Benchmarking SDK

Build your own internal benchmarking pipeline using corrected truth files and semantic word lists.

Open Github

Why your word error rate (WER) benchmark might be lying to you

WER benchmarks can mislead you. Learn why standard transcription metrics often favor older models, and how AssemblyAI rethought evaluation with Universal-3 Pro.

Read what WER is and isn't measuring

What builders say about AssemblyAI

In the medical context, accuracy is highly important….[and] there can be multiple people present. Separating them is key to accuracy. The biggest impact AssemblyAI has had has been in enabling our technical team to focus on workflow-specific features rather than a general speech-to-text pipeline,"

Jackson Bierfeldt, Cofounder + CTO, JotPsych

‍36% improvement in WER

Like many AI-forward companies, Dovetail experiments constantly with evolving models and technologies. With partners like AssemblyAI providing the transcription backbone, Dovetail’s teams are free to dream, build, and ship features once thought impossible.

Test with your own audio in under 10 minutes

No commits, no minimums. Run the benchmarking SDK on your data and compare against any model. Starts at $0.15/hr.

Compare Models in Playground See Evaluation Docs