Word error rate is broken: How to actually evaluate speech-to-text in 2026
Our team recently held a workshop to discuss why the industry's standard benchmark is leading teams toward the wrong vendor, and what to use instead.



Word error rate (WER) measures the percentage of words a speech recognition system gets wrong compared to a human reference transcript. It counts substitutions, insertions, and deletions, producing a single number that's been the default evaluation metric for decades. The problem: WER treats every word equally — missing a filler word counts the same as missing a drug name — and it can actually penalize more accurate models for transcribing words that human transcribers missed. Our team recently held a workshop to discuss why the industry's standard benchmark is leading teams toward the wrong vendor, and what to use instead.
On March 31, 2026, AssemblyAI's applied AI engineers — Zach, Griffin, and Ryan — ran a live workshop called "Your Ground Truth Is Wrong." The claim: word error rate, the metric the entire speech-to-text industry uses to compare models, is fundamentally broken. And they could prove it with real audio, in real time, in front of an audience.
What followed was a 60-minute session where the team opened actual ground truth files, played back the source audio, and showed — word by word — where the benchmarks were wrong.
The paradox: when a better model scores worse
When AssemblyAI launched Universal-3 Pro, customers came back with a surprising claim: the new model was performing worse on their benchmarks than the older Universal-2 model. That didn't match anything the team was seeing internally.
So they did something simple. They listened to the audio.
The vast majority of Universal-3 Pro's "errors" were correct.
Here's what happened: Universal-3 Pro was transcribing words that the human ground truth files had missed entirely. On difficult audio — noisy environments, accented speech, overlapping speakers, fast talkers — the model was outperforming the human-labeled transcription it was being evaluated against. More accurate transcription meant more "insertions," which meant a higher word error rate. The metric was penalizing the model for being more correct.
This isn't a theoretical edge case. Universal-3 Pro achieves a 1.52% WER on LibriSpeech Test Clean and a mean WER of 5.6% across 26 real-world datasets — the lowest error rates of any tested provider. When models reach this level of accuracy, the bottleneck shifts from model quality to evaluation quality.
And then there's the downstream problem. Consider a substitution: "My Cadillac isn't doing great" transcribed as "My cataracts aren't doing great." WER calls that one error. But if you're feeding that transcript into an LLM for clinical analysis, the entire downstream context is destroyed. It's not a transcription error. It's a pipeline failure. And WER has no way to distinguish between the two.
WER was designed for an era when AI transcription was rough and every word mattered equally. That era is over.
Three reasons WER fails in production
Not all errors are equal
WER counts "gonna" transcribed as "going to" the same as "lisinopril" transcribed as "listening a pill." One is a formatting difference that preserves all meaning. The other is a medication name that, if wrong, could harm a patient.
A single-number metric that weights every word identically can't capture this distinction. And it matters more than ever now that transcripts flow directly into LLMs, analytics pipelines, and automated decision systems. In a voice agent context, a substitution (a plausible guess) is actually preferable to a deletion (a missed word) — because a deletion can cause a "hanging" turn where the agent receives nothing and the conversation stalls. Traditional WER treats both error types identically.
Ground truth files are often wrong
Human transcriptionists miss words. Specifically, they miss backchannels ("okay," "uh-huh"), overlapping speech, quiet affirmations, false starts, and disfluencies. When a more capable model captures these, WER counts them as insertions — errors.
This is the moment from the workshop that changed how AssemblyAI thinks about benchmarking. In a medical conversation, the human ground truth was missing "I want." Universal-3 Pro had it. Zach played back the audio. The AI was right. Word error rate called it an error. The team called it a win.
There's a whole category of speech that human transcribers are trained (or inclined) to skip: a patient whispering "okay" under their breath while a doctor is still talking. Universal-3 Pro catches it. In a WER benchmark, that shows up as an error. The benchmark becomes a measure of how closely the AI matches flawed human labels, not how accurately it transcribes audio.
This is why Artificial Analysis, an independent AI benchmarking organization, had to create proprietary evaluation datasets with manually corrected ground truths when building their Speech-to-Text leaderboard. Existing public datasets contain systematic human transcription errors that penalize models which are actually more accurate.
Normalization gaps hide real differences
The Whisper Normalizer — the open-source Python package most teams use to standardize transcriptions before comparing them — catches some formatting variations. "Don't" maps to "do not." But it misses others.
"Healthcare" vs. "health care." "All right" vs. "alright." "Setup" vs. "set up." Each of these is a valid formatting choice, but in a WER evaluation, every one registers as an error. Since Universal-3 Pro has a language model decoder at its core, it has a stronger command of grammar and formatting than traditional transcription models — which means it produces more of these "different-but-correct" outputs. And gets penalized for it.
The fix isn't to abandon normalization. It's to build semantic word lists — domain-specific equivalences that your normalizer applies before running WER calculations — so that no model gets unfairly penalized for valid formatting choices.
For a deeper dive into the insertion paradox and semantic equivalence problem, read Zach Klebanoff's companion post: "Why your word error rate (WER) benchmark might be lying to you."
The metrics that actually work
If WER is broken, what should you use instead? The answer isn't one metric — it's a portfolio matched to your use case. Here are the four that matter in 2026.
Semantic WER
What it is: A modified version of WER that applies normalization and domain-specific word lists before calculating errors. Instead of treating every word difference as an error, Semantic WER classifies differences into categories: no penalty for variant spellings, contractions, and number format differences; a minor penalty (0.5) for single-character errors; and a major penalty (1.0) for meaning-altering substitutions.
When to use it: Voice agents and LLM-powered pipelines where meaning preservation matters more than exact word matching. If your transcripts feed into an LLM rather than being read directly by humans, Semantic WER is almost always a better evaluation signal than traditional WER.
Frameworks like Pipecat's open-source STT benchmark have begun standardizing Semantic WER as an evaluation tool, using reasoning models as judges to reduce scoring bias.
Missed Entity Rate (MER)
What it is: Instead of asking "how many words did you get wrong?", MER asks "did you get the words that actually matter?" You define the entities — drug names, credit card numbers, email addresses, product names, dates — and measure survival rate. It's the metric that directly reflects business outcomes in regulated domains.
When to use it: Medical, legal, and financial applications where specific terminology carries disproportionate importance. Also critical for voice agents where entity accuracy (account numbers, confirmation codes) drives whether the conversation succeeds or fails.
In the workshop, David walked through the MER results on medical audio: Universal-3 Pro with Medical Mode achieved 0% missed entity rate on the drug name. Whisper large-v3: 8.3%. The entity Whisper missed? The drug name. One metric. One word. That result tells you more about production readiness than any WER number could.
AssemblyAI's Universal-3 Pro delivers the lowest missed entity rates across categories on real-world audio — 12.0% on medical terms, 13.1% on names, and 19.6% on phone numbers — compared to the next-best competitors at 13.0%, 14.6%, and 20.1% respectively.
LLM-as-a-Judge (LASER scoring)
What it is: LASER (LLM-based ASR Evaluation Rubric) is a published evaluation metric (Parulekar & Jyothi, EMNLP 2025) that uses a large language model to align each word in the ASR output against the reference transcription and assign per-word penalties. Unlike Semantic WER, which outputs a single number, LASER provides structured per-error feedback — making it useful for prompt optimization workflows where you need to understand why a prompt performed poorly, not just how much error there was.
When to use it: Conversational AI, clean transcription evaluation, and any workflow where you're iterating on prompts or system configurations and need diagnostic feedback alongside a quality score.
Comparison: which metric for which use case
The right approach for most production teams: run all four in parallel against your evaluation dataset, then weight the results based on your use case. A voice agent team should prioritize Semantic WER and emission latency. A medical scribe team should look at MER first. A meeting transcription product should care most about traditional WER on normalized output, because humans are reading those transcripts directly.
How to build a benchmark that actually works
The workshop didn't just diagnose the problem — it walked through the fix. Here's the step-by-step evaluation framework, with links to the open-source tools you need.
Step 1: Fix your ground truth
Before you evaluate any model, audit your reference transcripts. The Truth File Corrector (available in the AssemblyAI dashboard) surfaces every discrepancy between your human-labeled ground truth and the model output. For each disagreement, it plays back the corresponding audio segment inline. You click one of three options: the AI was right (update the truth file), the AI was wrong (keep the truth file), or neither is correct (write in what was actually spoken).
You only have to do this once. The corrected truth files can be reused for all future benchmarks, including competitive comparisons against other vendors. In AssemblyAI's testing, the majority of "insertions" flagged by WER were cases where Universal-3 Pro correctly transcribed audio that the human transcriber had missed.
Step 2: Build semantic word lists
Create a find-and-replace list for domain-specific equivalences before running WER calculations. "Healthcare" and "health care." "Alright" and "all right." "Setup" and "set up." Port this list into your evaluation pipeline so that these formatting differences don't penalize any vendor unfairly.
The benchmarking SDK on GitHub includes a Semantic Word List Generation tool that helps you build these lists systematically from your evaluation data.
Step 3: Evaluate with the right metric for your use case
Don't default to WER. Match the metric to the job:
• Building a medical scribe? Missed Entity Rate on drug names, conditions, and procedures. Enable Medical Mode (domain: "medical-v1") to significantly improve accuracy on clinical terminology — it corrects terms like "lisprohumalog" to the standard "Lispro (Humalog)" format automatically.
• Building a voice agent? Emission latency + streaming WER + Semantic WER. Universal-3 Pro Streaming delivers 8.14% WER across all English domains with industry-leading entity accuracy on names, phone numbers, and emails.
• Building a meeting summarizer? LLM-as-a-Judge on summary quality, not transcript WER. The transcript is an intermediate artifact — what matters is whether the summary captures the meeting accurately.
• Running a competitive evaluation? Corrected ground truth + multiple metrics in parallel. See the AssemblyAI docs evaluation guide for the full step-by-step process.
Step 4: A/B test in production
Benchmarks on static datasets tell you part of the story. Production tells you the rest. Roll a percentage of live traffic to the new model and measure what actually matters to your business: support ticket volume, task completion rate, human correction frequency, NPS.
These metrics don't show up in a WER table. But they're the metrics your users experience.
The benchmarking SDK calculates WER, Semantic WER, MER, and LASER scoring on your own audio. Compare any vendor. It's free and open-source on GitHub.
Benchmark results: where things stand in 2026
Numbers talk. Here's where the major providers land on the metrics that matter, using AssemblyAI's published benchmarks evaluated across 250+ hours of audio, 80,000+ files, and 26 datasets.
Pre-recorded (async) transcription
LibriSpeech Test Clean dataset. Standard text normalization applied. Updated February 2026.
Streaming (real-time) transcription
250+ hours of audio across 26 datasets. Emission latency = time from word spoken to word returned. Updated March 2026.
Missed entity rate by category
This is where the practical impact of model accuracy shows up — the entities that drive business outcomes are the ones most models get wrong.
Lower is better. Real-world customer audio.
A note on benchmarks: models are sometimes trained on the same public datasets used for evaluation, which inflates scores. Public benchmarks can be misleading due to overfitting. Always run your own evaluation on your actual audio data before making a vendor decision.
When WER still makes sense
This post makes a strong case against relying on WER alone. But WER isn't useless — and pretending otherwise would be dishonest.
WER is still valuable as a quick directional baseline, especially when you're comparing the same model across versions or audio conditions. If you're tracking how a model improves from v1 to v2 on the same test set, WER gives you a clean, comparable signal.
For read-speech benchmarks — audiobooks, news broadcasts, prepared remarks — WER correlates reasonably well with perceived quality. The audio is clean. The ground truth is reliable. The failure modes this post describes (missed backchannels, insertion penalties, normalization gaps) don't bite as hard on prepared, single-speaker content.
The problem isn't WER itself. It's using WER as the only metric, and using it on domains and audio types it was never designed for. Think of WER as a thermometer: useful for a quick health check, but you wouldn't diagnose a patient based on temperature alone.
The industry is moving toward evaluation frameworks that combine multiple metrics matched to use case. The AssemblyAI docs evaluation guide walks through the full process — from dataset preparation to ground truth correction to metric selection — so you can build an evaluation that actually reflects what your users experience.
Frequently asked questions
What is word error rate (WER) and why is it important for speech recognition?
Word error rate measures the percentage of words an automatic speech recognition system gets wrong compared to a human reference transcript, counting substitutions, insertions, and deletions. WER has been the standard evaluation metric for decades because it provides a single, comparable number across systems. However, WER has significant limitations: it treats all word errors equally (missing a drug name counts the same as missing a filler word), and it can penalize more accurate models — like AssemblyAI's Universal-3 Pro — for transcribing words human transcribers missed. Teams evaluating speech-to-text in 2026 should use WER alongside newer metrics like Semantic WER and Missed Entity Rate.
How do I reduce my speech-to-text word error rate?
WER is calculated as (Substitutions + Insertions + Deletions) / Total Words in Reference x 100. To reduce it: use a model trained on audio similar to your domain (AssemblyAI's Universal-3 Pro achieves 5.6% mean WER across 26 real-world datasets), enable domain-specific features like Medical Mode for clinical content, provide custom vocabulary or keyterm prompts for uncommon terminology, and ensure your input audio is high quality (16kHz+ sample rate, low background noise). That said, reducing WER may not always improve real-world performance — a model with slightly higher WER but 0% missed entity rate on critical terms may serve your application better than one that scores lower on WER but misses the words that matter.
What metrics measure transcription accuracy and quality beyond WER?
The four main metrics for measuring speech-to-text quality are: (1) Word Error Rate (WER), which counts raw word-level errors against a reference transcript; (2) Semantic WER, which applies normalization and domain-specific equivalences before calculating errors to remove false positives from valid formatting differences; (3) Missed Entity Rate (MER), which measures whether critical named entities — drug names, account numbers, email addresses — survived transcription; and (4) LLM-as-a-Judge scoring (LASER), which uses a language model to evaluate meaning preservation with structured per-word feedback. For production applications, the best approach combines multiple metrics: AssemblyAI's open-source benchmarking SDK calculates all four against your own audio files.
What factors affect the accuracy of speech-to-text transcripts?
Key factors include audio quality (sample rate, background noise, microphone distance), speaker characteristics (accents, speaking rate, overlapping speakers), domain vocabulary (medical terminology, legal jargon, product names), and the model itself. Environmental factors have the largest impact — a model that achieves 1.5% WER on clean audio may exceed 10% on noisy call center recordings. For domain-specific content, models with specialized modes like AssemblyAI's Medical Mode significantly improve accuracy on clinical terminology, correctly formatting drug names like "Lispro (Humalog)" that general-purpose models transcribe incorrectly.
Does real-time transcription sacrifice accuracy for speed?
There is a tradeoff, but it's narrowing. Streaming speech-to-text models operate on partial audio context, which historically meant lower accuracy than batch processing. In current benchmarks, AssemblyAI's Universal-3 Pro Streaming achieves 8.14% WER across all English domains — compared to 5.6% mean WER for the async Universal-3 Pro model. For voice agents and real-time applications, emission latency (how quickly finalized text arrives after the user stops speaking) often matters more than raw WER, since delays compound into noticeable lag. AssemblyAI's Universal Streaming model delivers 305ms median emission latency — fast enough that conversations flow naturally.
How do I set up an automated speech-to-text evaluation pipeline?
Start by building a representative evaluation dataset of at least 25 audio files that match your production use case. Generate corrected ground truth using AssemblyAI's Truth File Corrector, which surfaces disagreements between your human reference and the model output with inline audio playback. Then run evaluations using the open-source benchmarking SDK, which calculates WER, Semantic WER, Missed Entity Rate, and LASER scoring. Track results over time to detect drift, and pair quantitative metrics with qualitative analysis — side-by-side transcript comparison and LLM-as-a-judge evaluation help catch differences that numbers alone miss. The AssemblyAI evaluation docs walk through the full process.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.




