
Speech-to-text evals that actually understand model performance
then fail on real audio (noise, numbers, domain terms, overlap, edge cases).
Which evaluation metric is right for your use-case?
Speech-to-text accuracy can be measured across standard ASR benchmarks such as WER, or using domain-specific evals such as Missed Entity Rate (MER) and Semantic WER.
Build a find-and-replace list for domain-specific equivalences (healthcare/health care, alright/all right) before running WER calculations.
Not all errors are equal
WER treats "gonna → going to" identically to "lisinopril → listening a pill." One preserves all context. One breaks your downstream pipeline.
Better models often get penalized for catching what humans missed in “ground truth.” Use the Truth File Corrector to surface disagreements between your ground truth and the model, listen back, and update your reference transcripts.
Real-World Performance - Speech-to-Text Benchmarks
Results across real-world datasets. WER shown alongside Missed Entity Rate (MER) — the metric that matters when specific words drive downstream decisions.
Model | Average WER | MER - Medical Terms | MER - Date & Time | MER - Locations |
|---|---|---|---|---|
AssemblyAI Universal-3 Pro | 5.93% | 13.61% | 7.5% | 8.26% |
OpenAI GPT-4o | 6.87% | 16.50% | 12.29% | 12.15% |
Deepgram Nova-3 | 7.9% | 16.95% | 18.69% | 13.94% |
Word Error Rate vs Missed Entity Rate - Streaming Benchmarks
Build voice agents that sound natural and understand context for downstream conversations. *Lower is better
Model | Average Streaming WER | Average Streaming MER |
|---|---|---|
AssemblyAI Universal-3 Pro Streaming | 8.14% | 16.7% |
OpenAI GPT-4o | 9.90% | 23.3% |
Deepgram Nova-3 | 11.06% | 25.2% |
What builders say about AssemblyAI
In the medical context, accuracy is highly important….[and] there can be multiple people present. Separating them is key to accuracy. The biggest impact AssemblyAI has had has been in enabling our technical team to focus on workflow-specific features rather than a general speech-to-text pipeline,"

Like many AI-forward companies, Dovetail experiments constantly with evolving models and technologies. With partners like AssemblyAI providing the transcription backbone, Dovetail’s teams are free to dream, build, and ship features once thought impossible.

Test with your own audio in under 10 minutes
No commits, no minimums. Run the benchmarking SDK on your data and compare against any model. Starts at $0.15/hr.
















