March 24, 2026

Why your word error rate (WER) benchmark might be lying to you

WER benchmarks can mislead you. Learn why standard transcription metrics often favor older models, and how AssemblyAI rethought evaluation with Universal-3 Pro.

Zackary Klebanoff

Applied AI Lead

Word Error Rate (WER)

Speech-to-Text

Reviewed by

Table of contents

[Visible on live site]

At AssemblyAI, we've spent years helping customers evaluate speech-to-text performance. Word Error Rate (WER) has long been the gold standard metric for this — and for good reason. It's simple, reproducible, and gives you a number you can compare across vendors.

But something unexpected happened when we launched Universal-3 Pro, our most advanced transcription model to date. Some customers came back to us saying their benchmarks showed the new model performing worse than our older models. That didn't match anything we were seeing internally.

So we dug in. What we found changed how we think about transcription evaluation entirely.

How traditional WER benchmarking works

Before we get into the issue, let's level-set on the standard evaluation process — because understanding how it works is key to understanding what's breaking down.

Here's the typical workflow:

Collect 10–20 audio files representative of your core use cases.
Submit them to a human transcription provider to get a high-quality ground truth (or "truth file").
Submit the same audio files to multiple speech-to-text vendors to get AI-generated transcriptions.
Normalize both sets of transcriptions using the Whisper Normalizer — an open-source Python package that lowercases everything, strips punctuation, and maps semantically equivalent words to a common form (e.g., don't → do not).
Run WER evaluation using an open-source Python package like jiwer, which compares your normalized truth file against your normalized AI predictions.

WER is simply the percentage of words in the audio that were transcribed incorrectly. Lower is better.

This process has been the primary evaluation method for the better part of a decade. It's how customers have compared vendors, justified switching decisions, and tracked model improvements over time. And every year, models have gotten better, and model WER continues to decrease over time.

The three types of WER errors

When you run a WER evaluation, errors fall into several categories. The three most common are:

Insertions (Hallucinations): Words the AI added that weren't in the human transcript.
Deletions: Words present in the human transcript that the AI missed entirely.
Substitutions: Words where the AI transcribed one word instead of another. For example, a human said "My Cadillac isn't doing great" but the AI transcribed "My cataracts aren't doing great." Same number of words — very different meaning.

Substitutions are particularly problematic downstream, because if you're feeding transcriptions into a large language model or any AI-based speech understanding tools, a substitution like Cadillac → cataracts doesn't just create a transcription error. It corrupts the contextual information entirely.

That's why one of our primary focuses when building Universal-3 Pro was reducing these kinds of errors — specifically around what we call Missed Entity Rate: how accurately the model transcribes rare and high-value words like proper nouns, company names, medication names, alphanumerics, phone numbers, dates, and addresses.

See how Universal-3 Pro performs on your audio

Create a free account and run Universal-3 Pro on your own files. Measure substitutions, insertions, and deletions against your ground truth before you benchmark anything else.

What's actually happening with Universal-3 Pro

So back to the mystery. Customers were seeing worse WER numbers with our new model. We were confident the model was better. What's happening here?

When we started reviewing customer benchmarks in detail, we noticed something striking: a disproportionate number of insertions. Not deletions, not substitutions — insertions. The new model was inserting words that weren't in the human truth files.

So we did something simple: we went back and actually listened to those insertions in the original audio.

The vast majority of them were correct.

The AI was transcribing words that were genuinely spoken in the audio — words that the human transcriptionist had missed. On difficult audio files (noisy environments, accented speech, overlapping speakers, fast talkers), our model was frequently outperforming the human-labeled transcription it was being evaluated against.

This creates a fundamental problem with the traditional WER evaluation framework: if your ground truth is wrong, your evaluation is wrong. And the better your AI model gets, the more this problem is exposed.

A second issue: Semantic equivalence

There's a related but distinct issue we also uncovered around substitutions.

Because Universal-3 Pro has a large language model decoder at its core, it has a much stronger command of grammar and formatting than traditional transcription models. In practice, this means it sometimes transcribes things correctly in ways that differ from how a human transcriptionist formatted the same content.

A few common examples: a human transcriptionist might write "all right" (two words) while our model writes "alright" (one word). Or a human writes "health care" while our model writes "healthcare." A third case is compound words — a human might write "set up" as two words while our model correctly writes it as "setup." The meaning is identical across all three cases — but in a WER evaluation, each one registers as an error.

The Whisper Normalizer handles some of this automatically (the don't/do not case, for example), but it can't cover every domain-specific or stylistic variation that might appear in your audio.

The tool we built to fix this

We recognize that asking customers to manually audit every error in their truth files is an unreasonable ask, especially when you're dealing with hour-long recordings. So we built a tool to make this fast and systematic.

Here's what it does:

1. Truth File Correction

You submit your existing truth files and your audio. The tool runs a transcription with Universal-3 Pro and surfaces every discrepancy between the AI output and your human truth file — one by one, with the corresponding audio segment playable inline.

For each discrepancy, you click one of three options:

✅ AssemblyAI got this right — update the truth file to match.
❌ AssemblyAI got this wrong — keep the truth file as-is.
✏️ Neither is correct — write in what was actually spoken. (You can also label segments as [inaudible] where the audio is genuinely unclear.)

As you work through the errors, the WER updates live in the tool so you can see the impact in real time. You only have to do this once — and the corrected truth files can be reused for all future benchmark evaluations, including competitive comparisons against other vendors.

2. Semantic Word List Generation

For the formatting and equivalence issues described above, the tool lets you build a semantic word list — essentially an array of word groupings that should be treated as equivalent in your evaluations.

Once you've built this list, you can port it into your evaluation pipeline so that these stylistic differences don't penalize any vendor unfairly. It’s basically a simple find & replace built into your evaluation pipeline for words that are semantically the same. This gives you cleaner, more meaningful WER numbers across the board.

Try our new Truth File Corrector and Semantic Word List generator in our playground.

Build a more accurate WER benchmark today

Access Universal-3 Pro, corrected truth file tooling, and our benchmarking GitHub repo through a free account. Start measuring what your transcription model actually gets right.

A note on benchmarking streaming vs. async models

One quick aside for teams doing more sophisticated benchmarks: async and streaming models should be evaluated separately. Async models are straightforward — send the full audio file, get back a complete transcript. Streaming models require you to feed audio bytes in real time, which means your benchmark has to run in parallel for the full duration of each audio file. It's significantly more complex to set up correctly.

Universal-3 Pro is available in both async and streaming form. If you're benchmarking both, make sure you're not mixing results.

The GitHub repository

Alongside this tool, we're releasing a GitHub repository that shows you how to build your own internal benchmarking pipeline using corrected truth files and semantic word lists. Whether you want to run evaluations on a schedule, integrate WER tracking into your CI/CD pipeline, or run competitive benchmarks across multiple vendors, the repo gives you a solid starting point.

The bigger picture

WER has served the industry well. It's still a useful metric and we're not suggesting you abandon it.

But the industry is at an inflection point. Models are getting good enough — in some cases better than human transcriptionists on difficult audio — that the traditional evaluation framework is starting to show its age. When your ground truth has more errors than your AI, the number your benchmark produces isn't just unhelpful. It's actively misleading.

The right response isn't to throw out metric-based evaluation. It's to make the evaluation more accurate. That means better truth files, smarter normalization, and a clearer understanding of what WER is and isn't measuring.

That's what we're helping our customers with every day! If you’re interested in learning more, please reach out to the AssemblyAI team!

‍

Why your word error rate (WER) benchmark might be lying to you

How traditional WER benchmarking works

The three types of WER errors

What's actually happening with Universal-3 Pro

A second issue: Semantic equivalence

The tool we built to fix this

1. Truth File Correction

2. Semantic Word List Generation

A note on benchmarking streaming vs. async models

The GitHub repository

The bigger picture

AssemblyAI Universal-3 Pro vs Deepgram Nova-3: An honest comparison for developers

5 Deepgram alternatives in 2026

5 Speechmatics alternatives in 2026

Top 8 open source STT options for voice applications in 2026

Machine Learning Concepts for Beginners

Speech-to-Text Accuracy on Podcasts, News Broadcasts, and Social Media

How to build a voice agent with Twilio and AssemblyAI

Newsletter #32:⚡️ Upgrades To Streaming Speech-to-Text

Why your word error rate (WER) benchmark might be lying to you

How traditional WER benchmarking works

The three types of WER errors

What's actually happening with Universal-3 Pro

A second issue: Semantic equivalence

The tool we built to fix this

1. Truth File Correction

2. Semantic Word List Generation

A note on benchmarking streaming vs. async models

The GitHub repository

The bigger picture

Related posts

AssemblyAI Universal-3 Pro vs Deepgram Nova-3: An honest comparison for developers

5 Deepgram alternatives in 2026

5 Speechmatics alternatives in 2026

Top 8 open source STT options for voice applications in 2026

Machine Learning Concepts for Beginners

Speech-to-Text Accuracy on Podcasts, News Broadcasts, and Social Media

How to build a voice agent with Twilio and AssemblyAI

Newsletter #32:⚡️ Upgrades To Streaming Speech-to-Text