June 23, 2026

How we measure medical transcription: MER, and why WER lies to you

WER weights a filler word the same as a wrong drug name — which makes it the wrong metric for clinical audio. Here's why, and what Missed Entity Rate measures instead.

Kelsey Foster

Growth

Medical

Reviewed by

Table of contents

[Visible on live site]

Word error rate has a flattering quality. It rolls every mistake into one clean percentage, and a clean percentage is easy to put on a slide.

It's also lying to you about medical audio.

Here's the problem in one sentence: WER treats every word as equal. Missing an "um" costs exactly the same as turning "hydrochlorothiazide" into "hydrocortisone." A filler word and a beta-blocker, weighted identically.

That's fine for a podcast. It's indefensible for a clinical transcript. So let me walk through why WER misleads evaluators, what we measure instead, and how the numbers actually shake out.

What WER counts, and what it ignores

WER is substitutions plus insertions plus deletions, divided by the number of reference words. Every error in that numerator carries weight one. The metric has no concept of which words matter.

Think about what a medical conversation actually contains. The vast majority of words are connective tissue—"the," "and," "patient," "we'll," "okay." A handful of words carry the entire clinical meaning: the drug, the dose, the diagnosis, the procedure. Maybe 5% of the tokens hold 95% of the risk.

WER averages across all of them. So a model can flub the 5% that matter and still post a great-looking score, because it nailed the 95% that don't. We've written before about how word error rate is broken and how your WER benchmark might be lying to you—this is the same disease in its most dangerous form.
Consider two transcription errors:

"uh, the patient" → "the patient" (dropped a filler)
"metoprolol" → "metformin" (a cardiac drug became a diabetes drug)

To WER, both are one error. One of them is noise. The other could change a treatment decision. A metric that can't tell them apart is the wrong metric for healthcare.

Missed Entity Rate, defined

So we measure the thing that actually matters: how often the clinically meaningful words come out wrong.
Missed Entity Rate (MER) is the percentage of medical entities not correctly transcribed. By "entities" we mean the words a clinician or a downstream coding system depends on:

Drug names—generic and brand
Diagnoses and conditions
Procedures
Dosages and units

MER ignores whether the model dropped an "um." It asks a narrower, harder question: when a drug name appeared in the audio, did the transcript get it right? When a procedure was named, did it survive intact, or did "echocardiogram" become "echo cardiogram" and fall out of your entity extraction?

This is the metric that maps to clinical risk. A model can sit at a respectable WER and still have a terrible MER—and if you're building anything that touches patient care, MER is the number you should be staring at.‍

Want to see this on your own audio instead of ours?

Run the benchmark on representative files from your domain.

Try playground

The benchmark

Here's where the two metrics part ways. We benchmarked Universal-3 Pro with Medical Mode against the providers evaluators actually shortlist, measuring both MER and WER. Lower is better for each; MER is the share of entities not correctly transcribed.

Provider / configuration	MER	WER
AssemblyAI Universal-3 Pro w/ Medical Mode	3.2%	5.3%
Deepgram	3.6%	5.5%
Speechmatics Enhanced Medical	4.7%	6.1%
Deepgram Nova-3 Medical	8.7%	5.9%
AWS Transcribe Medical	24.4%	12.9%
Google Medical Conversation	—	—

Read the table for a second before the conclusion, because the rows tell the story better than I can.

Look at Deepgram Nova-3 Medical: 5.9% WER, 8.7% MER. If you'd evaluated on WER alone, you'd see a single point of difference from the top of the table and shrug. But its MER is more than double ours—it's missing entities at almost three times the rate while looking nearly identical on the headline metric. That's WER's flattery in action.
AWS Transcribe Medical makes the gap impossible to miss: 12.9% WER, but a 24.4% MER. Nearly a quarter of medical entities not transcribed correctly. The WER alone wouldn't scream that loudly.

Universal-3 Pro with Medical Mode posts the lowest MER in the set at 3.2%. That's the claim that matters here, and it's an entity claim, not a word-count claim. Full methodology and the rest of the numbers are on /benchmarks.

How we measure it, at a high level

You shouldn't take a benchmark on faith, so here's the shape of how MER is computed.

Start with reference transcripts—human-verified ground truth for real clinical-style audio. Identify the medical entities in each reference: the drugs, diagnoses, procedures, and dosages. Run each model over the same audio. Then, for every reference entity, check whether the model's output contains a correct match. The share that don't match is the MER.

The detail that does the work is entity alignment. "Echocardiogram" rendered as "echo cardiogram" is two tokens where the reference has one, so a naive comparison can misfire. Robust entity matching has to handle tokenization differences, casing, and generic-versus-brand naming, so that you're scoring clinical correctness rather than punishing formatting. This is the same care we describe in how to evaluate speech recognition models.

It's worth saying plainly: no public benchmark is your benchmark. Our test set reflects our distribution of accents, recording conditions, specialties, and drug frequencies. Yours will differ. The point of publishing MER isn't "trust our number"—it's "measure the right thing, then measure it on your audio."

Why this isn't just an AssemblyAI talking point

I'd make the MER argument even if we lost the benchmark, because the alternative is worse. An industry that evaluates medical transcription on WER alone is optimizing models to be confident and wrong about exactly the words that carry risk. A model trained and tuned to minimize WER has every incentive to get common words perfect and treat a rare drug name as a rounding error.

That's backwards. The rare words are the whole job. If you want to understand how far speech-to-text has come on the words that aren't rare, we cover that in how accurate is speech-to-text in 2026—but general accuracy and medical entity accuracy are different problems, and conflating them is how teams ship clinical tools that look fine in a demo and fail on the third encounter.

Frequently asked questions

What's the difference between MER and WER?

WER measures the share of all words transcribed incorrectly, weighting every word equally. MER measures only the share of medical entities—drug names, diagnoses, procedures, dosages—transcribed incorrectly. A model can have a low WER and a high MER if it gets common words right and clinical terms wrong.

Why is WER a poor metric for medical transcription?

Because it treats a dropped "um" the same as a wrong drug name. In clinical audio, a small fraction of words carry almost all the meaning and risk, and WER averages them away. See word error rate is broken for the longer argument.

Which model has the lowest MER?

In our benchmark, Universal-3 Pro with Medical Mode posts the lowest MER at 3.2%, ahead of Deepgram, Speechmatics, AWS, and Google. The full table is on /benchmarks.

Can I reproduce this on my own audio?

Yes—and you should. Public benchmarks reflect the publisher's data distribution, not yours. Run your own representative clinical files through the playground and compare entities against your ground truth.

Does a low MER mean I can skip human review?

No. A lower MER means fewer entities to catch and less QA burden, but clinical workflows still warrant human verification. The value of MER is telling you where the residual risk lives—in the entities—so you can review the right things.

‍

How we measure medical transcription: MER, and why WER lies to you

What WER counts, and what it ignores

Missed Entity Rate, defined

The benchmark

How we measure it, at a high level

Why this isn't just an AssemblyAI talking point

Frequently asked questions

What's the difference between MER and WER?

Why is WER a poor metric for medical transcription?

Which model has the lowest MER?

Can I reproduce this on my own audio?

Does a low MER mean I can skip human review?

Medical transcription in Spanish, German, and French: multilingual clinical accuracy

Building behavioral health documentation that clinicians trust

Veterinary transcription API: handling species, breeds, and vet drug names

Wrong drug name in, wrong SOAP note out: error propagation in clinical AI pipelines

Newsletter #33: Make.com Voice AI Integration and Streaming STT Updates

How to build a voice agent for IT helpdesk and technical support

Introducing Medical Mode: Purpose-built accuracy for medical terminology

Streaming speaker diarization: How to identify who's speaking in real time

How we measure medical transcription: MER, and why WER lies to you

What WER counts, and what it ignores

Missed Entity Rate, defined

The benchmark

How we measure it, at a high level

Why this isn't just an AssemblyAI talking point

Frequently asked questions

What's the difference between MER and WER?

Why is WER a poor metric for medical transcription?

Which model has the lowest MER?

Can I reproduce this on my own audio?

Does a low MER mean I can skip human review?

Related posts

Medical transcription in Spanish, German, and French: multilingual clinical accuracy

Building behavioral health documentation that clinicians trust

Veterinary transcription API: handling species, breeds, and vet drug names

Wrong drug name in, wrong SOAP note out: error propagation in clinical AI pipelines

Newsletter #33: Make.com Voice AI Integration and Streaming STT Updates

How to build a voice agent for IT helpdesk and technical support

Introducing Medical Mode: Purpose-built accuracy for medical terminology

Streaming speaker diarization: How to identify who's speaking in real time