Speech-to-text evals that actually understand model performance

Your WER benchmark might be lying to you. Many speech-to-text models look good in demos,
then fail on real audio (noise, numbers, domain terms, overlap, edge cases).

Which evaluation metric is right for your use-case?

Speech-to-text accuracy can be measured across standard ASR benchmarks such as WER, or using domain-specific evals such as Missed Entity Rate (MER) and Semantic WER.

Build a find-and-replace list for domain-specific equivalences (healthcare/health care, alright/all right) before running WER calculations.

Compare speech-to-text models

Not all errors are equal

WER treats "gonna → going to" identically to "lisinopril → listening a pill." One preserves all context. One breaks your downstream pipeline.

Better models often get penalized for catching what humans missed in “ground truth.” Use the Truth File Corrector to surface disagreements between your ground truth and the model, listen back, and update your reference transcripts.

Update your truth files

Real-World Performance - Speech-to-Text Benchmarks

Results across real-world datasets. WER shown alongside Missed Entity Rate (MER) — the metric that matters when specific words drive downstream decisions.

Model
Average WER
MER - Medical Terms
MER - Date & Time
MER - Locations
AssemblyAI
Universal-3 Pro
5.93%
13.61%
7.5%
8.26%
OpenAI
GPT-4o
6.87%
16.50%
12.29%
12.15%
Deepgram
Nova-3
7.9%
16.95%
18.69%
13.94%
WER calculated across 26 public datasets

Word Error Rate vs Missed Entity Rate - Streaming Benchmarks

Build voice agents that sound natural and understand context for downstream conversations. *Lower is better

Model
Average Streaming WER
Average Streaming MER
AssemblyAI Universal-3 Pro Streaming
8.14%
16.7%
OpenAI
GPT-4o
9.90%
23.3%
Deepgram
Nova-3
11.06%
25.2%
WER calculated across 26 public datasets

Test with your own audio in under 10 minutes

No commits, no minimums. Run the benchmarking SDK on your data and compare against any model. Starts at $0.15/hr.