June 23, 2026

Wrong drug name in, wrong SOAP note out: error propagation in clinical AI pipelines

The LLM writing your SOAP note never hears the audio — it trusts the transcript. So one misheard drug name becomes fluent, confident, completely wrong documentation. Here's why.

Kelsey Foster

Growth

Medical

Reviewed by

Table of contents

[Visible on live site]

A clinical AI pipeline looks deceptively simple on a whiteboard. Audio goes in. A speech-to-text model turns it into text. An LLM reads that text and writes the SOAP note, pulls the medication list, maybe suggests ICD-10 codes for billing. Three boxes, two arrows, done.

The problem lives in the arrows.

Every stage in that pipeline inherits whatever the stage before it produced. Not "mostly inherits." Inherits completely. The LLM that writes your assessment never hears the audio. It never sees the waveform. It sees text—and it treats that text as ground truth, because it has no other choice. If the transcription layer hands it the word "hydrocortisone" when the clinician said "hydrochlorothiazide," the model doesn't flag the discrepancy. It can't. There's nothing to compare against. It writes a fluent, confident, clinically coherent note about the wrong drug.

That's the failure mode nobody benchmarks for. Garbage in, fluent garbage out.

The error doesn't stay where it started

Here's what makes entity errors in clinical pipelines worse than they look at first glance: they don't fail loudly. A misrecognized drug name doesn't produce a typo or a [INAUDIBLE] tag. It produces a different real word that the downstream model accepts without suspicion.

Walk through one swap. A physician dictates a follow-up for a patient on hydrochlorothiazide for hypertension. The STT layer transcribes it as hydrocortisone—a real drug, spelled correctly, phonetically adjacent, and completely wrong therapeutically. One's a thiazide diuretic for blood pressure. The other's a corticosteroid.

Now watch the blast radius.

The medication list updates to hydrocortisone. The LLM writing the assessment reads "patient on hydrocortisone" and reasons accordingly—maybe it notes steroid considerations, maybe it adjusts the plan around a drug the patient never took. The coding step downstream maps to the wrong therapeutic class. The billing record reflects a medication that isn't in the chart for any clinical reason. And every one of those artifacts reads cleanly. A reviewer skimming the SOAP note sees fluent, plausible prose. Nothing screams "error" because the error has been laundered into grammatical, confident English at every stage.

Try another. The clinician says metformin—a first-line drug for type 2 diabetes. The transcript says metronidazole, an antibiotic. Suddenly the note implies a diabetic patient is being managed with an antimicrobial, the problem list drifts, and the coding logic follows the transcript off a cliff. The LLM can't recover metformin. It was never given metformin to recover.

This is the part that gets underappreciated: the LLM isn't the weak link here. A frontier model writing the note might be doing flawless work—faithfully, accurately summarizing the text it received. The accuracy ceiling for the entire pipeline was set upstream, at the transcription layer, before the LLM ever woke up.

Why fixing it downstream doesn't work

The intuitive instinct is to patch this at the LLM stage. Add a verification prompt. Ask the model to double-check medications. Build a reconciliation step.

It doesn't hold up, and the reason is information-theoretic, not a matter of prompt engineering. You cannot reconstruct information that was destroyed before it reached you. When the STT layer collapsed "hydrochlorothiazide" into "hydrocortisone," the signal that distinguished those two words—the acoustic detail in the audio—is gone by the time the text exists. The LLM has access to the wrong word and a high-quality language model's prior that the wrong word is perfectly reasonable in context. Asking it to catch the error is asking it to detect a problem it has no evidence for.

You could feed the audio to the LLM directly and skip the transcript. Some teams will. But for the vast majority of production clinical pipelines—ambient scribes, dictation workflows, coding automation—text is the interface, the audit trail, and the thing humans actually review. The transcript is load-bearing. Which means transcript accuracy, specifically entity accuracy, is the highest-leverage place in the entire system to prevent propagation.

Fix it at the source and everything downstream gets cheaper, safer, and more trustworthy. Fix it downstream and you're building increasingly elaborate machinery to compensate for a problem that should never have entered the pipeline.‍

Want to see how entity accuracy holds up on your own clinical audio? Talk to our team or test a file in the playground.

Why WER isn't the metric that matters here

For years the industry has graded speech-to-text on word error rate. It's a useful number. It's also the wrong number for clinical work, and the reason traces directly back to error propagation.

WER treats every word as equally important. Miss "the," miss "hydrochlorothiazide"—both count as one error against the same denominator. A model can post an impressive WER while quietly fumbling the exact words that propagate downstream. We've written before about why WER is broken and why your WER benchmark might be lying to you—the short version is that a single aggregate number hides the distribution of which words get missed.
In a clinical pipeline, the distribution is everything. The words that matter are drug names, dosages, proper nouns, lab values, anatomical terms—the entities. A model that nails common words and drops entities will look fine on WER and fail catastrophically on the thing you actually care about.

That's why we track missed entity rate (MER) as a first-class metric, not an afterthought. MER measures how often the model drops or mangles the clinical entities that drive every downstream decision. If you're only watching WER, you're watching the wrong gauge. Our full methodology lives in how to evaluate speech recognition models.

What Medical Mode actually does about it

Universal-3 Pro already delivers best-in-market entity accuracy on drug names, proper nouns, and rare words. Medical Mode tightens that further for clinical audio specifically.

Turning it on is one parameter:

import assemblyai as aai

transcriber = aai.Transcriber()
config = aai.TranscriptionConfig(
speech_models=["universal-3-pro"],
domain="medical-v1",
)
transcript = transcriber.transcribe("audio.wav", config)

That domain="medical-v1" flag works on both Universal-3 Pro and Universal-3.5 Pro Realtime, across English, Spanish, German, and French.

The numbers, measured on our benchmarks: Universal-3 Pro with Medical Mode posts a 3.2% MER—the lowest across every provider we tested, and roughly 20% fewer missed medical entities than Universal-3 Pro alone. Here's how that stacks up:

Provider	MER	WER
AssemblyAI Universal-3 Pro w/ Medical Mode	3.2%	5.3%
Deepgram	3.6%	5.5%
Speechmatics Enhanced Medical	4.7%	6.1%
Deepgram Nova-3 Medical	8.7%	5.9%
AWS Transcribe Medical	24.4%	12.9%

Look at the AWS row. A 24.4% MER means roughly one in four clinical entities gets missed or mangled. In a pipeline that propagates every error downstream, that's not a transcription model—it's an error generator with a confident downstream LLM amplifying its mistakes into polished, wrong documentation.
Pricing is a $0.15/hr add-on on top of Universal-3 Pro's $0.21/hr, so $0.36/hr all in. Full details on the pricing page.

The rest of a pipeline you can actually trust

Entity accuracy is the highest-leverage lever, but it's not the only one. Two more pieces matter for a clinical pipeline you'd put your name on.

First, speaker diarization. In an exam-room recording, who said what changes the meaning of the note. A symptom the patient reports and an instruction the clinician gives are different clinical objects, and diarization keeps them separated before the LLM ever tries to structure them. Get the speaker attribution wrong and you've introduced a different flavor of the same propagation problem.

Second, PII redaction. A trustworthy pipeline doesn't just transcribe accurately—it controls what flows downstream. Redacting protected identifiers at the transcript layer means the LLM and every system after it operate on the minimum data they need.

On the infrastructure side: AssemblyAI enables covered entities and their business associates subject to HIPAA to use the AssemblyAI services to process protected health information (PHI). AssemblyAI is considered a business associate under HIPAA, and we offer a standard Business Associate Addendum (BAA). It runs on BAA-eligible infrastructure with a BAA included.

Frequently asked questions

Why can't a good LLM just catch a wrong drug name?

Because it never receives the audio. The LLM only sees the transcript. Once the speech-to-text layer has substituted one real drug name for another, the acoustic information that distinguished them is gone. The model has no evidence anything is wrong, so it writes a fluent note around the incorrect entity.

Isn't a low word error rate enough to trust a clinical transcript?

No. WER weights every word equally, so a model can score well overall while consistently missing the drug names, dosages, and proper nouns that drive downstream decisions. Missed entity rate measures the words that actually propagate. That's why we report MER alongside WER.

How do I turn on Medical Mode?

Set one parameter: domain="medical-v1". It works on Universal-3 Pro and Universal-3.5 Pro Realtime, in English, Spanish, German, and French.

What does Medical Mode cost?

It's a $0.15/hr add-on. Combined with Universal-3 Pro's $0.21/hr, the total is $0.36/hr.

What if my audio has rare or facility-specific drug names?

Use keyterms prompting to load them—up to 1,000 terms on async, up to 100 on streaming, updatable mid-stream. It's the mechanism for specialized vocabulary that even a strong general model might not weight heavily enough.