May 19, 2026

How speech recognition errors compound in production voice agents

Standard benchmarks measure WER on clean audio. Production voice agents don't run in clean rooms. Here's why entity accuracy — not word error rate — is the metric that predicts whether your voice agent actually works.

Devon Malloy

Staff Growth Manager

Voice Agent API

Reviewed by

Table of contents

[Visible on live site]

The patient says: "I need to refill my prescription. The number is RX-7704132. It's Metoprolol, 80 milligrams."

That's not a hard sentence. No heavy accent, no road noise, no other voices in the background. It's a 19-word utterance delivered clearly into a phone. Your voice agent handles millions of calls. This is not an edge case.

Now the agent mishears "RX-7704132" as "RX-7704182." One digit. A 5% error on that token alone, which barely registers on a word error rate benchmark. The patient gets a callback about a prescription that doesn't match their record. They're asked to repeat it. They've already said it once, clearly, into a phone—and the agent still got it wrong. They hang up.

Or worse: the agent doesn't ask. It routes the wrong prescription number downstream. The clinical record picks up a transposition error. Someone downstream acts on it.

This isn't a story about AI hallucination or prompt engineering or context window limitations. It's a story about a word error rate benchmark that told you your speech-to-text accuracy was acceptable—and was right. The benchmark was accurate. It just wasn't measuring the words that matter.

Why standard benchmarks miss the point

Word error rate is simple and defensible. Take a transcript, count the words the model got wrong, divide by the total number of words, express as a percentage. A model with 5% WER on a 20-word sentence makes, on average, one error per sentence.

The problem is what that number hides.

A 5% WER on "I need to refill my prescription, the number is RX-7704132, it's Metoprolol, 80 milligrams" means one word wrong. But not all words are equal. "I," "to," "it's," "my"—these are function words. A transcription model can drop or mangle all of them and the LLM will still understand what the user said. The meaning is recoverable. But "RX-7704132," "Metoprolol," and "80 milligrams" aren't recoverable from context. They're exact values. They have to be right. A 5% WER score that earns its miss on "RX-7704132" tells you almost nothing about whether your voice agent will work.

This is the gap that standard benchmarks don't close—and it explains a finding from the AssemblyAI Voice Agent Report (January 2026) that should have been a wake-up call for the field. Seventy-six percent of voice agent builders rated STT accuracy as the single most important non-negotiable factor in their platform decisions—above latency, above cost, above integration capabilities. Not "important." Most important. By a wide margin.

If accuracy matters that much to the people actually building production voice agents, why do most platforms treat it as a solved problem? Because the benchmark they're running doesn't test the words their users will actually say. WER on clean, conversational audio drawn from benchmark corpora isn't the same thing as accuracy on account numbers, prescription IDs, email addresses, medication names, and customer surnames—the words voice agents actually need to capture.

The data bears this out. In a head-to-head comparison between AssemblyAI and Deepgram on production-representative audio, AssemblyAI's missed entity rate was 16.7%—roughly 1 in 6 named entities (account numbers, proper names, alphanumeric strings) transcribed incorrectly. Deepgram's missed entity rate on the same audio was 25.5%. Both companies would show you a WER number and call it close. But when you move the benchmark to the thing that actually breaks voice agents—entity accuracy—the gap is significant.

ProviderMissed entity rateEntities captured (per 100)AssemblyAI16.7%83Deepgram25.5%75

Entity accuracy isn't a niche concern. It's the main concern. Voice agents exist to capture specific information. The transcript's job isn't to produce readable prose. It's to get the right values into the right fields, reliably, in real time. Benchmarks that don't test that capability don't measure what builders care about.

The compounding mechanism

Here is the piece of the accuracy argument that rarely gets stated explicitly, even though it's the most important part: transcription errors don't stay isolated. They compound.

Each turn of a voice agent conversation is a function of what the STT layer heard in the previous turn. The LLM doesn't have access to the audio. It only has the transcript. When the transcript is wrong, the LLM works from a wrong premise—and responds accordingly. That response is what the user hears next.

Walk through three examples, in increasing order of consequence.

Lead qualification. The agent asks the prospect for their name. The prospect says "Joaquin Reyes." The STT layer transcribes it as "Wakeem Race." The agent continues through the qualification flow, captures the rest of the information correctly, and routes the lead to the CRM as "Wakeem Race." The rep who picks up the follow-up call asks for Wakeem. The prospect, who isn't Wakeem, either corrects the rep or hangs up. Either way, the first moment of human contact starts with the company demonstrating that it couldn't hear the prospect's name. The lead conversion rate on that call isn't high.

Appointment booking. The agent is scheduling a service call. The customer says "Wednesday at two." The STT layer transcribes "Wednesday at noon"—a common phonetic near-miss between "two" and "noon" in certain accents and audio environments. The agent confirms the booking. The customer shows up at 2pm. The service team is slotted for 12pm. The customer calls back angry. The double-booking takes 20 minutes to resolve. This is a recoverable failure—but it required two additional customer contacts, a scheduling conflict, and an irritated customer who will remember that the company's "automated system" got it wrong.

Clinical intake. The patient says "I'm allergic to Lisinopril." The STT layer, running in a general-purpose mode without domain vocabulary boosting, transcribes "I'm allergic to Lisinopril" as "I'm allergic to Bisoprolol." Both are cardiovascular medications. The downstream EHR system records the wrong allergy. The prescribing physician, trusting the intake record, doesn't ask again. The patient receives Lisinopril.

These aren't hypothetical edge cases. They're the ordinary failure modes of deploying voice agents into the industries adopting them fastest: healthcare, financial services, contact centers, sales. The stakes vary across industries—a wrong appointment time is annoying, a wrong medication allergy is a patient safety event—but the mechanism is identical. A wrong word in turn 1 corrupts the context that turn 2 builds on. A wrong word in turn 2 that goes uncorrected corrupts turn 3. A conversation that diverges from reality in turn 1 can recover, but recovery requires the agent to recognize the error, surface it, and ask the user to repeat themselves. Fifty-five percent of voice agent users, per the same Voice Agent Report, say the experience they hate most is being asked to repeat themselves.

The accuracy problem and the experience problem are the same problem.

This isn't a marketing claim about feature completeness. It's a diagnosis of where in the stack quality actually gets decided. Get the transcript right, and everything downstream—the LLM reasoning, the tool calls, the CRM write, the clinical record—works from a correct foundation. Get it wrong, and you're compounding errors across a conversation that can't see the audio to correct itself. STT accuracy isn't one factor among many. It's the factor everything else depends on.

The specific words that break most models

Five categories concentrate voice agent accuracy failures disproportionately. Understanding them is the first step to asking the right questions when evaluating a platform.

Proper nouns and names. The most common entities voice agents are deployed to capture are also the most unpredictable for generic STT models. Customer names, patient names, business names, and place names sit outside the distribution of benchmark training data. A model trained on clean LibriSpeech audio has never heard "Kawamoto" or "Przybylski" or "DeShawn." It will approximate. The approximation will be wrong. Even with speaker diarization labeling who said what, the words themselves still have to be right.

Alphanumeric strings. Account numbers, prescription IDs, confirmation codes, order numbers—these are the backbone of transactional voice agent use cases, and they're unusually brittle. Acoustic similarity between letters (B/V/P, D/T, M/N) and between numbers (5/9, 4/14) creates confusion that phonetic models struggle with. The absence of semantic context—a string of digits and letters has no meaning to fall back on—means there's no recovery path. The model either gets it right or it doesn't.

Domain terminology. Medication names, legal terms, financial product names, industry jargon—every deployment context has vocabulary a general-purpose model has seen rarely or never. The ability to prompt the STT layer with domain context, or to boost specific keyterms, separates a model that adapts to a deployment context from one doing its best with what it knows.

Accented and multilingual speech. The Voice Agent Report documented significant concerns about accuracy degradation on non-native English speech and non-English languages. A voice agent deployed in a contact center doesn't choose its callers' accents. AssemblyAI's Universal-3 supports six languages with equivalent accuracy across all six—not a model tuned for American English that "also supports" five others.

Backchannels and disfluencies. "Uh-huh," "yeah," "ok," "mm-hmm," "right"—these vocalizations don't carry semantic content, but they carry turn-taking signals. A model that treats every "yeah" as the end of the user's turn will interrupt. A model that ignores silence because it's waiting for more words will create unnatural pauses. Turn detection is a transcription accuracy problem, not just a streaming architecture problem.

What "purpose-built" actually means in this context

The phrase "purpose-built for voice agents" gets used loosely. Here's what it means in practice, and why it matters for the accuracy problem.

A general-purpose STT model adapted for streaming can produce low-latency transcripts. That's necessary but not sufficient. The failure modes described above—proper nouns, alphanumeric strings, domain terminology, accented speech—need a model designed from the ground up for the constraints of voice agent deployments, not retrofitted for them afterward. See our real-time speech recognition primer for how the streaming architecture differs from batch.

Universal-3 Pro Streaming ships with three specific capabilities that address the accuracy problem directly. First, promptability: tell the model what domain it's operating in, and it adjusts its priors for domain-specific vocabulary before a single word is spoken. Second, keyterms: boost specific words or phrases—medication names, account ID formats, product names—so the model recognizes them reliably even in degraded audio. Third, turn detection that distinguishes a long pause mid-sentence from an end-of-turn signal—so the model doesn't interrupt a patient trying to read out a twelve-digit prescription number. The streaming docs walk through how to wire these into a live session.

These aren't convenience features. They're the mechanisms that close the gap between "accurate on benchmarks" and "accurate on the words your users will actually say." The benchmark doesn't know your users are going to say "Metoprolol." The model, if it's been built for production deployment, can be told.

The right question to ask

Return to the prescription refill. RX-7704132. Metoprolol. 80 milligrams.

The voice agents being deployed into healthcare, financial services, contact centers, and sales don't handle toy tasks. They capture information that downstream systems act on, that clinical records are built from, that financial transactions are confirmed by. The standard for accuracy in these contexts isn't "low word error rate on clean audio." It's "correct on the specific values that the agent exists to capture." For teams building in regulated environments, AssemblyAI is a business associate under HIPAA and signs a Business Associate Addendum (BAA) for customers processing PHI.

Seventy-six percent of the builders actually shipping voice agents into production have already figured this out. They rank accuracy first—not latency, not cost, not the TTS voice quality—because they've seen what happens when the transcript is wrong. They've watched the compounding failures accumulate across conversation turns. They've gotten the support tickets. For a broader look at the architecture decisions that go into a production deployment, see our voice agents guide and the meeting intelligence and notetaker solutions page.

The right question to ask any voice agent platform isn't "what's your WER?" It's "what's your missed entity rate on the words my users will actually say?" Those are different questions. The first one has a comfortable answer. The second one has a correct one.

Frequently asked questions

What is entity accuracy in speech-to-text?

Entity accuracy measures how often a transcription model correctly captures named entities—proper names, account numbers, alphanumeric IDs, dates, medication names, and other specific values. Unlike word error rate, which weights every word equally, entity accuracy focuses on the tokens that downstream systems depend on. For voice agents, it's a much stronger predictor of production quality than aggregate WER.

How does word error rate differ from missed entity rate?

Word error rate (WER) counts every word substitution, deletion, and insertion equally across an entire transcript. Missed entity rate counts only errors on named entities—names, numbers, IDs, domain terms. A model can hit a low WER by getting function words right while still missing 1 in 4 of the entities a voice agent actually needs. The two metrics often disagree, and missed entity rate is the one that maps to real-world agent failures.

Why do voice agent errors compound across turns?

The LLM driving a voice agent never sees the audio—only the transcript. A wrong word in turn 1 enters the conversation context and shapes the agent's response in turn 2, which shapes turn 3, and so on. Without explicit error detection, the conversation drifts further from what the user actually said. The only recovery path is to ask the user to repeat—which 55% of voice agent users say is the experience they hate most.

What can I do to improve accuracy on names and account numbers?

Use a streaming STT model with keyterms boosting and domain promptability. Pass the model a list of expected proper nouns, ID formats, and domain vocabulary before the session starts so it weights those tokens more heavily during decoding. Universal-3 Pro Streaming exposes both controls through the streaming API.

Is AssemblyAI a good fit for healthcare voice agents?

AssemblyAI is considered a business associate under HIPAA and offers a standard Business Associate Addendum (BAA) for customers processing PHI. Combined with keyterms boosting for medication names and clinical terminology, that makes the platform a fit for ambient scribe, intake, and triage workloads. See pricing and contact sales to execute a BAA.

Where can I see streaming STT pricing and limits?

The pricing page lists per-minute rates for streaming and async transcription, and the developer docs cover rate limits, concurrency, and the voice agents integration guide.

How speech recognition errors compound in production voice agents

Why standard benchmarks miss the point

The compounding mechanism

The specific words that break most models

What "purpose-built" actually means in this context

The right question to ask

Frequently asked questions

What is entity accuracy in speech-to-text?

How does word error rate differ from missed entity rate?

Why do voice agent errors compound across turns?

What can I do to improve accuracy on names and account numbers?

Is AssemblyAI a good fit for healthcare voice agents?

Where can I see streaming STT pricing and limits?

Which voice agent API has the best developer experience? What to evaluate

Voice agent architectures explained: STT→LLM→TTS vs. speech-to-speech vs. one API

Best API for building a speech-to-speech voice agent in 2026

AssemblyAI vs Deepgram: what's the best voice agent API?

How to use AI to build powerful market research tools

Introducing Medical Mode: Purpose-built accuracy for medical terminology

Extract phone call insights with LLMs in Python

How to integrate spoken audio into LlamaIndex.TS using AssemblyAI

How speech recognition errors compound in production voice agents

Why standard benchmarks miss the point

The compounding mechanism

The specific words that break most models

What "purpose-built" actually means in this context

The right question to ask

Frequently asked questions

What is entity accuracy in speech-to-text?

How does word error rate differ from missed entity rate?

Why do voice agent errors compound across turns?

What can I do to improve accuracy on names and account numbers?

Is AssemblyAI a good fit for healthcare voice agents?

Where can I see streaming STT pricing and limits?

Related posts

Which voice agent API has the best developer experience? What to evaluate

Voice agent architectures explained: STT→LLM→TTS vs. speech-to-speech vs. one API

Best API for building a speech-to-speech voice agent in 2026

AssemblyAI vs Deepgram: what's the best voice agent API?

How to use AI to build powerful market research tools

Introducing Medical Mode: Purpose-built accuracy for medical terminology

Extract phone call insights with LLMs in Python

How to integrate spoken audio into LlamaIndex.TS using AssemblyAI