Transcription accuracy vs. transcription quality: why the gap matters
Your speech-to-text model has a great word error rate — so why are users still complaining? Because WER doesn't capture speaker mislabeling, formatting issues, or entity errors that shape how accurate a transcript feels.



Your speech-to-text model has a great word error rate. Your benchmarks look solid. So why are users still complaining that the transcription "feels wrong"?
Because WER doesn't measure what customers care about.
The gap between accuracy and perception
Word error rate is the industry standard for measuring speech-to-text accuracy, and for good reason—it's quantifiable, comparable, and well-understood. But here's the problem: WER measures whether the right words appear in the transcript. It says nothing about whether the transcript looks right to the person reading it.
Think about it. A transcript can have a near-perfect WER and still feel broken if speakers are mislabeled, if stray audio tags clutter the output, or if punctuation is inconsistent. Conversely, a transcript with a slightly higher WER but clean formatting, accurate speaker labels, and natural paragraph breaks will feel more reliable to users.
This is the perceived quality gap—and it rarely shows up in published benchmarks.
The data backs this up. According to AssemblyAI's Voice Agent Report, 55% of end users cite "having to repeat themselves" as their top frustration with voice agents, and 45% cite "frequently misheard words"—even though 82.5% of builders feel confident in their ability to build. On the builder side, 52.5% name transcription accuracy as their single biggest challenge. The gap between builder confidence and user frustration is the perceived quality gap made measurable: teams think they've solved accuracy, but users disagree.
Audio tags: when accuracy backfires
Here's a concrete example of how doing the "right thing" can make transcription quality worse.
Many speech-to-text systems insert audio event tags into transcripts—things like [MUSIC], [NOISE], or [LAUGHTER]. From a technical standpoint, this is accurate. The model detected non-speech audio and labeled it. WER doesn't penalize you for it. If anything, it's a feature.
But when we looked at how users responded to transcripts with audio tags, the picture was different. Users reported that tagged transcripts felt less accurate than untagged ones—even when the underlying words were identical. An [MUSIC] tag dropped into the middle of a meeting transcript made people doubt everything around it. "If the system is picking up background noise, how do I know the words are right?"
It's not a rational response, but it's a real one—and real user perception drives product decisions like NPS scores and renewal rates.
So we removed audio tags from transcripts by default. The WER didn't change. The perceived quality went up.
Speaker diarization: the trust multiplier
Speaker mislabeling is a far more damaging perception problem than audio tags.
Consider a contact center running thousands of calls through a speech-to-text pipeline every day. Each call gets transcribed with speaker diarization—Speaker 1 is the agent, Speaker 2 is the customer. Downstream systems use those labels to analyze agent performance, flag compliance issues, and generate call summaries.
Now imagine Speaker 1 and Speaker 2 get swapped on 3% of calls.
From a WER perspective, nothing went wrong. Every word is correct. But from the customer's perspective, their analytics are corrupted. Agent performance scores are unreliable. Compliance flags fire on the wrong speaker. The entire pipeline's credibility is undermined by a problem that WER can't even see.
We've worked with enterprise customers pushing multichannel speaker diarization to its limits in production—hundreds of concurrent sessions, variable audio quality, speakers talking over each other. At that scale, diarization accuracy isn't a nice-to-have. It's a trust requirement. One mislabeled speaker in a compliance-critical transcript doesn't just create an error. It destroys confidence in the system.
This is why speech-to-text accuracy can't be reduced to a single number. The accuracy that matters is contextual—right words and right speaker, delivered in a structure users can trust.
The hard problem of streaming corrections
Streaming speech-to-text introduces a unique challenge for perceived quality: the output is live.
When you're transcribing pre-recorded audio, you have the luxury of processing the entire file before returning results. You can recluster speakers at the end, clean up edge cases, and deliver a polished final transcript. With streaming, you're committing to output in real time—sub-300ms latency for Universal-3 Pro Streaming—which means you sometimes have to make decisions with incomplete information.
Speaker assignment is a perfect example. Early in a conversation, the model hasn't heard enough audio to confidently distinguish speakers. It makes its best guess and moves on. Later, with more context, it might realize that the initial speaker assignments need correction.
The brute-force solution is end-of-stream reclustering: wait until the conversation ends, then reprocess all speaker labels with full context. That works for some use cases. But for applications where transcripts are consumed in real time—live agent assist, real-time coaching, compliance monitoring—waiting until the end isn't an option. Users have already seen the initial labels. A late correction feels like an error, even when it's an improvement.
So we're developing a different approach: speaker revision messages that arrive shortly after the initial output—a delayed correction that updates speaker labels while the conversation is still active, rather than waiting for the end. Recent streaming diarization improvements have already delivered measurable gains: a 56% reduction in phantom speaker detections, word-level speaker labels (rather than utterance-level), and reduced false alarm speakers across production workloads. It's a significant engineering investment that doesn't improve WER at all. What it improves is the user experience. The transcript stays accurate as it's being read, not just after it's finished.
That's the kind of investment you make when you understand that transcription quality is a perception problem, not just a measurement problem.
What "quality" means in production
Here's a framework for thinking about speech-to-text accuracy that goes beyond WER:
Word-level accuracy is the foundation. You need the right words. Universal-3 Pro achieves a 94.1% word accuracy rate—compared to 93.5% for ElevenLabs Scribe, 92.5% for Microsoft, 92.4% for OpenAI and Amazon, and 92.1% for Deepgram Nova-3 across 26 real-world datasets. That's table stakes for serious applications, and the gap between providers is wider than it looks: at production scale, even a 1–2 percentage point difference compounds across millions of utterances.
Entity accuracy is where differentiation starts. Getting names, numbers, email addresses, and domain-specific terms right matters disproportionately. A transcript that nails common words but mangles a customer's name or a dollar amount is worse than one with a slightly higher overall error rate that gets the important things right. This is exactly what Missed Entity Rate (MER) measures—and the gaps are significant. Universal-3 Pro achieves a 13.1% MER on names (vs. 15.3–19.4% for competitors), 12.0% on medical terms (vs. 15.3–18.4%), and 34.3% on emails and URLs (vs. 62–72% for every other provider tested). For applications where getting a customer's name or email right on the first try determines whether the interaction succeeds, entity accuracy is the metric that matters.
Structural accuracy is the perception layer. Are speakers correctly labeled? Is punctuation natural? Are sentence boundaries in the right places? Does the transcript read like something a human would produce? This is what determines whether users trust the output.
Temporal accuracy matters for streaming. Are corrections timely enough that users don't notice them? Does the transcript stay coherent as it's being generated? Real-time applications add a fourth dimension to quality that batch processing doesn't have to worry about.
Most transcription quality best practices focus on the first two layers. But production applications—especially those where humans read the transcripts—live or die on layers three and four. For a complete walkthrough of how to evaluate speech recognition models across all four layers, including ground truth correction and metric selection, see our evaluation guide.
Why this is hard to benchmark
The challenge with perceived quality is that it resists standardization. You can publish a WER number. You can't publish a "perceived quality" number.
That doesn't mean you can't measure it. User satisfaction surveys, support ticket categorization, A/B testing of formatting choices, monitoring of downstream pipeline accuracy—these are all proxies for perceived quality. They're harder to run than a benchmark, but they tell you something a benchmark can't.
The industry is starting to build better tools for this. Semantic WER—an emerging metric that uses an LLM as a judge to evaluate whether meaning is preserved, rather than checking word-for-word accuracy—is one promising direction. Instead of penalizing a model for transcribing "cannot" instead of "can't," Semantic WER asks whether the intent was preserved. Combined with Missed Entity Rate and domain-specific keyword accuracy, these newer metrics get closer to measuring what users actually experience. We've written extensively about why WER alone is insufficient and what to use instead.
How to improve transcription quality in production comes down to closing the loop between raw model output and user experience. Measure what your users see, not just what your model produces. If they're complaining about readability, formatting, or speaker accuracy, your WER score doesn't matter. You have a quality problem.
Perceived quality is the next competitive frontier
Here's where the industry is heading: as speech-to-text WER converges across providers—and it will, because the underlying research is increasingly shared—perceived quality becomes the primary differentiator. Two providers with near-identical WER will deliver radically different user experiences depending on how they handle formatting, speaker attribution, and real-time corrections.
This means the evaluation criteria for speech-to-text are shifting. Teams that only benchmark on WER are optimizing for a metric that's becoming commoditized. The teams building durable products are asking different questions: Do users trust what they see? Does the output hold up under real-time consumption? Can the system correct itself without breaking the reader's confidence?
If you're building an application where humans read transcripts—a contact center agent reviewing a live summary, or a developer building a voice agent pipeline—the question isn't just "how accurate is the speech-to-text?" It's "how accurate does it feel?"
That gap between raw accuracy and perceived quality is where the next generation of Voice AI products will be won or lost.
Frequently asked questions
What is perceived transcription quality and how does it differ from WER?
Perceived transcription quality measures how accurate a transcript feels to the person reading it—factoring in speaker labels, formatting, punctuation, and entity accuracy—rather than just word-for-word correctness. WER only counts substitutions, insertions, and deletions against a reference transcript. A transcript with perfect WER can still feel broken if speakers are mislabeled or punctuation is inconsistent, while a slightly higher-WER transcript with clean formatting and correct names often feels more reliable.
Why do users complain about transcription quality even when WER is low?
Because WER treats all words equally and ignores structural elements users care about. A misheard filler word and a mangled customer name both count as one error in WER, but users notice the name error far more. Speaker mislabeling, stray audio tags, and inconsistent punctuation also degrade perceived quality without affecting WER at all. AssemblyAI's Voice Agent Report found that 55% of end users cite "having to repeat themselves" as their top frustration—a perception problem, not a WER problem.
How do you measure transcription quality beyond word error rate?
Newer metrics like Semantic WER (which uses an LLM judge to evaluate meaning preservation), Missed Entity Rate (which tracks accuracy on names, numbers, emails, and domain-specific terms), and domain-specific keyword accuracy get closer to what users experience. AssemblyAI's Universal-3 Pro achieves a 13.1% MER on names and 34.3% on emails/URLs—roughly half the error rate of competitors on the entities that matter most in production.
What is speaker diarization and why does it affect perceived transcription quality?
Speaker diarization identifies "who spoke when" in multi-speaker audio, assigning labels like Speaker A and Speaker B throughout a transcript. When diarization is wrong—even on just 3% of calls—it corrupts downstream analytics, compliance flags, and call summaries. Users lose trust in the entire system because the errors are visible and disruptive, even though WER stays the same. AssemblyAI's diarization achieves a 2.9% error rate on speaker count accuracy, with recent streaming improvements reducing phantom speaker detections by 56%.
How does streaming transcription handle speaker label accuracy in real time?
Streaming diarization must assign speaker labels immediately as audio arrives, with no ability to revise past labels the way batch processing can. Early in a conversation, limited audio context means speaker assignments may be less stable. AssemblyAI addresses this with speaker revision messages—delayed corrections that update labels while the conversation is still active—plus word-level speaker labeling and reduced false alarm rates across production workloads.
What is Semantic WER and how does it improve speech-to-text evaluation?
Semantic WER uses a reasoning model (like Claude) to evaluate whether a transcript preserves the meaning of what was said, rather than checking exact word matches. "Cannot" vs. "can't" registers as an error in traditional WER but scores identically in Semantic WER because the meaning is preserved. This matters especially for voice agent pipelines where transcripts feed directly into LLMs—the downstream model doesn't care about exact wording, only intent. Combined with Missed Entity Rate, Semantic WER provides a more complete picture of real-world transcription quality.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.





