June 23, 2026

Medical transcription in Spanish, German, and French: multilingual clinical accuracy

In medicine, multilingual transcription isn't about how many languages you support — it's getting both halves of a code-switched clinical sentence right. Here's what breaks, and what holds.

Kelsey Foster

Growth

Medical

Reviewed by

Table of contents

[Visible on live site]

Most multilingual transcription stories are about coverage. How many languages does the model support? Forty? A hundred?

That's the wrong question for medicine.

In a real clinic, the hard problem isn't supporting French. It's the moment a Spanish-speaking patient in a US clinic says "me duele el pecho, like a pressure" in one breath, and your transcript has to get both halves right—the Spanish symptom and the English qualifier—without anyone touching a language setting. Coverage counts languages. Clinical accuracy counts the words inside a single messy sentence. Those are not the same achievement.

Medical Mode launches benchmarked across four languages—English, Spanish, German, and French—for both pre-recorded and streaming audio. But the part I want to spend this post on is the part that actually breaks competing systems: what happens at the seam between languages.

Why multilingual medical transcription is genuinely hard

Let's be honest about the difficulty, because it's easy to wave at "multilingual support" as if it were a switch you flip.

First, medical vocabulary is treacherous across languages precisely because so much of it shares Latin and Greek roots. "Hypertension," "hypertensión," "Hypertonie," and "hypertension" look and sound like cousins—and that similarity is a trap, not a help. A model has to map each one to the right clinical entity in the right language, not blur them into an average. Near-identical isn't identical, and in a medication or diagnosis field, close is wrong.

Second, accents. A francophone clinician in Montreal, a Swabian physician, and a Madrid pharmacist don't pronounce their own languages the way a textbook does, let alone the way a model's training distribution assumes.

Third—and this is the one that quietly wrecks deployments—real clinical encounters are mixed-language. US Hispanic care, the DACH region with its English-heavy medical training, francophone clinics with English drug brand names. People code-switch. They start a sentence in one language and finish it in another, drop an English drug name into a Spanish sentence, or answer a German question in English because that's how they learned the term.

A pipeline built on "detect the language, then transcribe in that language" has no good answer here. By the time it's committed to Spanish, the English half of the sentence is already a casualty.

Code-switching is the whole differentiator

So here's where it gets interesting.‍

Universal-3 Pro handles native intra-utterance code-switching across all six of its languages—English, Spanish, French, German, Italian, and Portuguese. Intra-utterance is the operative phrase. Not "we detect a language per file," and not even "per sentence." Within a single utterance, the model can follow a speaker as they move between languages and keep transcribing accurately the entire way through.

For live audio, Universal-3.5 Pro Realtime extends that same native code-switching to 18 languages—so a streaming clinical encounter that mixes languages mid-sentence holds the thread across a much wider range. Medical Mode itself is benchmarked across the four launch languages (English, Spanish, German, and French), but the underlying code-switching the model relies on is built in across all of them.

No language toggle. No separate pipeline per language. No detect-then-route logic you have to build and maintain.
That means the Spanglish sentence—"me duele el pecho, like a pressure"—comes through intact, both halves, because the model was never forced to pick a lane. The clinician who slips into English for the drug name and back into French for the symptom doesn't break the transcript. The patient who answers in their first language mid-question doesn't either.

We went deeper on how this works in our piece on multilingual streaming and native code-switching, and it's worth reading if you're serving any population where two languages share a room.

The practical upshot for product teams: you build one integration. Not one per market.

Where the accuracy actually shows up

Code-switching is the headline, but it only matters if the underlying transcription is clinically accurate, so let's tie it back to numbers.

Medical Mode delivers a 3.2% Missed Entity Rate—the lowest across the providers we benchmarked—and roughly 20% fewer missed medical entities than Universal-3 Pro running on its own. MER is the right yardstick here because it measures what a clinician actually cares about: how often a clinically meaningful entity, a drug or a dose or a diagnosis, gets dropped or mangled. And critically, Medical Mode is benchmarked across all four launch languages, so that 3.2% isn't an English-only figure dressed up as multilingual. See the full breakdown on the benchmarks page.

Activation is the same single parameter in every language.

import assemblyai as aai

transcriber = aai.Transcriber()
config = aai.TranscriptionConfig(
speech_models=["universal-3-pro"],
domain="medical-v1",
)
transcript = transcriber.transcribe("clinical-audio.wav", config)

‍
You can let the model auto-detect language, but you don't need a language toggle to get code-switching—it's native. The domain="medical-v1" flag is identical whether the audio is Spanish, German, French, English, or a mix of them in one recording.

This runs the same way for both modes. Pre-recorded audio goes through Universal-3 Pro at $0.21/hr, and live audio runs on Universal-3.5 Pro Realtime at $0.45/hr base, at streaming latency—end-of-turn detection lands around 300ms. Medical Mode adds $0.15/hr in either case—so Universal-3 Pro plus Medical Mode is $0.36/hr, and the streaming variant plus Medical Mode is $0.60/hr. The full table is on the pricing page.‍

Want to hear it handle your languages? Get a free API key or try it in the playground.

Region-specific drug names and keyterms prompting

One more wrinkle that matters specifically for multilingual healthcare: drug brand names don't translate, they vary by market.

The same molecule ships under different brand names in the US, Germany, France, and Spain. A general medical model can't possibly anticipate every regional formulary. So when you're deploying into the DACH region or francophone care, keyterms prompting lets you hand the model the specific brand names, local terminology, and formulary your clinicians actually use—before it processes a single second of audio. It's the difference between a model that knows the active ingredient and one that also knows what your pharmacist calls it.

Pair that with diarization for multi-speaker encounters and redact_pii for sensitive data, and you've got a clinical pipeline that holds up across markets.

On the privacy side: AssemblyAI enables covered entities and their business associates subject to HIPAA to use the AssemblyAI services to process protected health information (PHI). AssemblyAI is considered a business associate under HIPAA, and we offer a standard Business Associate Addendum (BAA) that is required under HIPAA to ensure that AssemblyAI appropriately safeguards PHI—on BAA-eligible infrastructure, with the BAA included.

The integration you don't have to fork

Here's the insight worth sitting with if you're choosing an approach for a multilingual patient population.

The hidden cost of "supports N languages" architectures isn't accuracy—it's branching. A detect-then-route pipeline forces you to maintain a code path per language, test each one, and handle the seams between them, and those seams are exactly where mixed-language encounters live. Native code-switching collapses that whole tree into one path. You're not just getting a better transcript on Spanglish; you're getting an integration that doesn't fork every time you enter a new market.

Build one pipeline, serve every patient who walks in, regardless of which language they reach for mid-sentence.‍

Ready to build it? Get your free API key or test your own clinical audio in the playground.

Frequently asked questions

Which languages does Medical Mode support?

Medical Mode is benchmarked across four languages at launch—English, Spanish, German, and French—for both pre-recorded and streaming audio. The underlying Universal-3 Pro model supports six languages for pre-recorded audio (adding Italian and Portuguese) with native code-switching, and Universal-3.5 Pro Realtime extends native code-switching to 18 languages for live audio.

What is intra-utterance code-switching, and why does it matter for clinics?

It means the model can follow a speaker who changes languages within a single utterance—not just per file or per sentence. In real clinical settings where patients and clinicians mix languages mid-thought, this keeps the transcript intact without a language toggle or a separate pipeline per language.

How accurate is multilingual medical transcription?

Medical Mode posts a 3.2% Missed Entity Rate, the lowest across the providers we benchmarked, with around 20% fewer missed medical entities than Universal-3 Pro alone—and it's benchmarked across all four launch languages, not English only. Details are on the benchmarks page.

How do I handle region-specific drug brand names?

Use keyterms prompting to supply the specific brand names, local terminology, and formulary your clinicians use before transcription runs. This is especially useful across markets where the same molecule ships under different brand names.

Does code-switching require a different setup than single-language transcription?

No. You activate Medical Mode with the single domain="medical-v1" parameter, and code-switching is native—there's no toggle to enable and no separate multilingual pipeline to configure. On streaming, if a session is genuinely monolingual you can pass language_code to bias the model toward one language; leave it off to keep native code-switching.

‍