Insights & Use Cases
March 23, 2026

Why your word error rate (WER) benchmark might be lying to you

WER benchmarks can mislead you. Learn why standard transcription metrics often favor older models, and how AssemblyAI rethought evaluation with Universal-3 Pro.

Zackary Klebanoff
Applied AI Lead
Reviewed by
No items found.
Table of contents

At AssemblyAI, we've spent years helping customers evaluate speech-to-text performance. Word Error Rate (WER) has long been the gold standard metric for this — and for good reason. It's simple, reproducible, and gives you a number you can compare across vendors.

But something unexpected happened when we launched Universal-3 Pro, our most advanced transcription model to date. Some customers came back to us saying their benchmarks showed the new model performing worse than our older models. That didn't match anything we were seeing internally.

So we dug in. What we found changed how we think about transcription evaluation entirely.

How traditional WER benchmarking works

Before we get into the issue, let's level-set on the standard evaluation process — because understanding how it works is key to understanding what's breaking down.

Here's the typical workflow:

  1. Collect 10–20 audio files representative of your core use cases.
  2. Submit them to a human transcription provider to get a high-quality ground truth (or "truth file").
  3. Submit the same audio files to multiple speech-to-text vendors to get AI-generated transcriptions.
  4. Normalize both sets of transcriptions using the Whisper Normalizer — an open-source Python package that lowercases everything, strips punctuation, and maps semantically equivalent words to a common form (e.g., don't → do not).
  5. Run WER evaluation using an open-source Python package like jiwer, which compares your normalized truth file against your normalized AI predictions.

WER is simply the percentage of words in the audio that were transcribed incorrectly. Lower is better.

This process has been the primary evaluation method for the better part of a decade. It's how customers have compared vendors, justified switching decisions, and tracked model improvements over time.

The three types of WER errors

When you run a WER evaluation, errors fall into several categories. The three most common are:

  • Insertions (Hallucinations): Words the AI added that weren't in the human transcript.
  • Deletions: Words present in the human transcript that the AI missed entirely.
  • Substitutions: Words where the AI transcribed one word instead of another. For example, a human said "My Cadillac isn't doing great" but the AI transcribed "My cataracts aren't doing great." Same number of words — very different meaning.

Substitutions are particularly problematic downstream, because if you're feeding transcriptions into a large language model or any AI-based analytics pipeline, a substitution like Cadillac → cataracts doesn't just create a transcription error. It corrupts the contextual information entirely.

That's why one of our primary focuses when building Universal-3 Pro was reducing these kinds of errors — specifically around what we call Missed Entity Rate: how accurately the model transcribes rare and high-value words like proper nouns, company names, medication names, alphanumerics, phone numbers, dates, and addresses.

See how Universal-3 Pro performs on your audio

Create a free account and run Universal-3 Pro on your own files. Measure substitutions, insertions, and deletions against your ground truth before you benchmark anything else.

Sign up free

What's actually happening with Universal-3 Pro

So back to the mystery. Customers were seeing worse WER numbers with our new model. We were confident the model was better. Who was right?

When we started reviewing customer benchmarks in detail, we noticed something striking: a disproportionate number of insertions. Not deletions, not substitutions — insertions. The new model was inserting words that weren't in the human truth files.

So we did something simple: we went back and actually listened to those insertions in the original audio.

The vast majority of them were correct.

The AI was transcribing words that were genuinely spoken in the audio — words that the human transcriptionist had missed. On difficult audio files (noisy environments, accented speech, overlapping speakers, fast talkers), our model was frequently outperforming the human-labeled transcription it was being evaluated against.

This creates a fundamental problem with the traditional WER evaluation framework: if your ground truth is wrong, your evaluation is wrong. And the better your AI model gets, the more this problem is exposed.

A second issue: Semantic equivalence

There's a related but distinct issue we also uncovered around substitutions.

Because Universal-3 Pro has a large language model decoder at its core, it has a much stronger command of grammar and formatting than traditional transcription models. In practice, this means it sometimes transcribes things correctly in ways that differ from how a human transcriptionist formatted the same content.

A few common examples: a human transcriptionist might write "all right" (two words) while our model writes "alright" (one word). Or a human writes "health care" while our model writes "healthcare." A third case is compound words — a human might write "set up" as two words while our model correctly writes it as "setup." The meaning is identical across all three cases — but in a WER evaluation, each one registers as an error.

The Whisper Normalizer handles some of this automatically (the don't/do not case, for example), but it can't cover every domain-specific or stylistic variation that might appear in your audio.

The tool we built to fix this

We recognize that asking customers to manually audit every error in their truth files is an unreasonable ask, especially when you're dealing with hour-long recordings. So we built a tool to make this fast and systematic.

Here's what it does:

1. Truth File Correction

You submit your existing truth files and your audio. The tool runs a transcription with Universal-3 Pro and surfaces every discrepancy between the AI output and your human truth file — one by one, with the corresponding audio segment playable inline.

For each discrepancy, you click one of three options:

  • AssemblyAI got this right — update the truth file to match.
  • AssemblyAI got this wrong — keep the truth file as-is.
  • ✏️ Neither is correct — write in what was actually spoken. (You can also label segments as [inaudible] where the audio is genuinely unclear.)

As you work through the errors, the WER updates live in the tool so you can see the impact in real time. You only have to do this once — and the corrected truth files can be reused for all future benchmark evaluations, including competitive comparisons against other vendors.

2. Semantic Word List Generation

For the formatting and equivalence issues described above, the tool lets you build a semantic word list — essentially an array of word groupings that should be treated as equivalent in your evaluations.

Once you've built this list, you can port it into your evaluation pipeline so that these stylistic differences don't penalize any vendor unfairly. This gives you cleaner, more meaningful WER numbers across the board.

Build a more accurate WER benchmark today

Access Universal-3 Pro, corrected truth file tooling, and our benchmarking GitHub repo through a free account. Start measuring what your transcription model actually gets right.

Sign up free

A note on benchmarking streaming vs. async models

One quick aside for teams doing more sophisticated benchmarks: async and streaming models should be evaluated separately. Async models are straightforward — send the full audio file, get back a complete transcript. Streaming models require you to feed audio bytes in real time, which means your benchmark has to run in parallel for the full duration of each audio file. It's significantly more complex to set up correctly.

Universal-3 Pro is available in both async and streaming form. If you're benchmarking both, make sure you're not mixing results.

The GitHub repository

Alongside this tool, we're releasing a GitHub repository that shows you how to build your own internal benchmarking pipeline using corrected truth files and semantic word lists. Whether you want to run evaluations on a schedule, integrate WER tracking into your CI/CD pipeline, or run competitive benchmarks across multiple vendors, the repo gives you a solid starting point.

The bigger picture

WER has served the industry well. It's still a useful metric and we're not suggesting you abandon it.

But the industry is at an inflection point. Models are getting good enough — in some cases better than human transcriptionists on difficult audio — that the traditional evaluation framework is starting to show its age. When your ground truth has more errors than your AI, the number your benchmark produces isn't just unhelpful. It's actively misleading.

The right response isn't to throw out metric-based evaluation. It's to make the evaluation more accurate. That means better truth files, smarter normalization, and a clearer understanding of what WER is and isn't measuring.

That's what we're trying to help you do.

Have questions about setting up your own WER benchmark or evaluating Universal-3 Pro? Reach out to our team or open an issue in the GitHub repo.

Learn the latest WER best practices

Join our free workshop to explore current best practices for measuring and improving Word Error Rate — and what it means for building accurate, production-ready voice AI. Register here.

Title goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Button Text
Word Error Rate (WER)
Speech-to-Text