June 23, 2026

Why streaming transcription drifts to English on multilingual audio — and how to fix language steering

Streaming speech-to-text keeps defaulting to English on multilingual audio. Here's why it drifts — and how to steer it back to the language actually spoken.

Kelsey Foster

Growth

multilingual

Streaming Speech-to-Text

Reviewed by

Table of contents

[Visible on live site]

You built a multilingual voice product, tested it on Spanish audio, and it worked. Then it hit production traffic and started handing back English. A caller says "necesito ayuda con mi factura" and the transcript reads "I need help with my invoice" — translated, not transcribed — or worse, a phonetic English mush that means nothing downstream.

If you're evaluating streaming speech-to-text for a multilingual product, this is the failure mode that quietly kills you in testing. It's not random, and it's not unfixable. Here's why streaming models drift to English, and how to steer them back.

The drift is a confidence problem, not a language problem

Most multilingual streaming models can transcribe your target languages. The trouble is what they do when they're unsure — and streaming makes them unsure far more often than batch processing does.

A pre-recorded model reads the whole file before committing. A streaming model has to decide what it heard within a few hundred milliseconds, working from a short rolling window of audio. Less evidence per decision means more uncertainty, and uncertainty is exactly where the default kicks in.

That default is usually English. Most ASR training data skews heavily English, so when a model can't confidently place a sound in Spanish or German, the safest bet — statistically — is English. The model isn't broken. It's guessing, and its prior says English.

Three situations push a streaming model into that low-confidence zone over and over:

Short utterances. "Sí." "Vale." "Mmhmm." A one-word turn gives the detector almost nothing to work with, so it falls back to its prior.
Code-switching boundaries. When a speaker drops an English word into a Spanish sentence — "mándame el invoice" — a model that detects language per turn can latch onto the English token and flip the whole turn.
Noise and accents. 8kHz phone audio, background chatter, or an underrepresented regional accent all lower confidence, and low confidence trends toward English.

So the drift isn't your audio being "too multilingual." It's the model hitting uncertainty and resolving it the wrong way.

Test drift on your own multilingual audio

Run your real production audio — short turns, code-switching, phone quality — through streaming transcription and see how it holds up. Start with a free account and clear docs.

Why "just force the language" usually backfires

The obvious instinct is to pin the language down — hard-code one language and be done with it. Sometimes that helps. Often it makes things worse, for one big reason: hard-coding one language breaks the moment a real conversation code-switches. Your Spanish caller says "my tracking number is..." and a single-language lock either drops it or mangles it. Real bilingual speech doesn't stay in one lane, so forcing one lane fights the audio.

The nuance worth getting right: there's a difference between blindly forcing a language and correctly telling the model what you already know. If your audio genuinely is monolingual — a support line in Osaka that runs in Japanese, a clinic intake in Madrid that runs in Spanish — committing the model to that language is now the recommended way to steer (more on the language_code parameter below). The mistake is forcing a single language onto audio that actually mixes languages. For that, you steer with context and let the model code-switch.

The fix isn't a bigger hammer. It's matching the model to your language reality and then giving its detector the signals it needs.

How to fix language steering

Think of language steering as five levers, in rough order of impact.

1. Match the model to how your users actually speak

This is the highest-leverage decision, and it's where most drift gets solved before it starts. AssemblyAI gives you streaming paths that behave differently on multilingual audio:

Universal-3.5 Pro Realtime (universal-3-5-pro) — native code-switching across 18 languages in a single stream: English, Spanish, French, German, Italian, Portuguese, Arabic, Danish, Dutch, Hebrew, Hindi, Japanese, Mandarin, Vietnamese, Finnish, Norwegian, Swedish, and Turkish. It treats a mid-sentence switch — Hinglish included — as ordinary speech instead of a language to re-detect, which is exactly the behavior that prevents drift on bilingual calls. This is the model that holds the line, and it's the new default for realtime transcription.
Universal-Streaming Multilingual (universal-streaming-multilingual) — a cost-effective path that covers a smaller set of languages with per-turn language detection. Per-turn detection is fine when speakers change languages between turns, but it's more prone to flipping on intra-sentence switches.
Whisper-Streaming (whisper-rt) — the 99+ language fallback when your languages fall outside the core set. Automatic language detection is built in and mandatory here.

Picking a per-turn model for an intra-sentence code-switching product is the single most common cause of drift we see. Match the model first.

2. Tell the model the language when you actually know it

Universal-3.5 Pro Realtime runs in multilingual mode by default — the right behavior when you don't know what's coming. But most production calls aren't a guessing game. When you know the session is monolingual, pass the new language_code connection parameter. It commits the model to one language instead of asking it to detect one, which is the cleanest way to head off the wrong-language slips that creep in on short or ambiguous audio. This is now the recommended way to bias toward a single language.

client.connect(
    StreamingParameters(
        sample_rate=16000,
        speech_model="universal-3-5-pro",
        language_code="es",   # commit to Spanish when you know the call is monolingual
    )
)

Omit language_code and you keep full multilingual code-switching. For calls that genuinely mix languages, leave it off and steer with context instead — that's what the next lever is for.

3. Give the model the conversation as context

A big share of low-confidence drift comes from the model hearing each moment cold, with no sense of what came before. Universal-3.5 Pro Realtime fixes that two ways. It keeps a short, rolling memory of the conversation (Context Carryover) and uses it as context for whatever comes next — on by default, nothing to configure. And for voice agents, you can pass the agent's own question in with agent_context, so a mumbled or short reply resolves through the lens of what was just asked. More context per decision means fewer of the uncertain moments that resolve toward English. You can also describe the audio with the prompt parameter to prime the model for a noisy line or a specific domain.

4. Anchor vocabulary with keyterms — and hear the speaker, not the room

Drift often shows up first on the words that matter most: names, product terms, account types, medication names. Universal-3.5 Pro Realtime includes keyterms prompting at no extra cost, and it applies across all supported languages at once. Feeding the model your domain vocabulary keeps those terms anchored in the right language instead of getting "Englished" the moment confidence dips. And because background speech — a TV, a passenger — gets transcribed as phantom words that drag confidence down, voice_focus isolates the primary speaker and suppresses the rest (use near_field for headsets and phones, far_field for rooms and kiosks).

5. Give the detector room to work

A lot of self-inflicted drift comes from starving the model of context. If you've cranked silence thresholds down to chase latency, you're chopping audio into fragments too short to place a language confidently. Leave enough audio per turn for the model to commit, and you'll see fewer fallbacks — especially on the short acknowledgments that trip every system. Our real-time transcription guide shows a working streaming setup you can adapt to test these settings.

See language steering live

Drop multilingual and code-switched audio into the playground and watch real-time transcription hold the right language. No setup required.

Try playground

What "fixed" looks like in production

When the model matches your audio and has the signals it needs, the drift stops being a coin flip. A bilingual support call stays bilingual in the transcript. The same speaker keeps producing Spanish when they speak Spanish and English when they speak English — including mid-sentence — and your downstream intent detection, routing, and analytics stop choking on phantom English.

Universal-3.5 Pro Realtime does this without trading away responsiveness: end-of-turn detection reads tonality, pacing, and rhythm rather than silence alone and lands around 300ms, and the model posts a market-leading 6.99% pooled word error rate on Pipecat's open benchmark of real agent conversations. You don't have to choose between "stays in the right language" and "fast enough for a voice agent." If you're benchmarking options, our guide on how to evaluate speech recognition models walks through testing this properly — and the Streaming Speech-to-Text API docs cover the configuration.

The bottom line

English drift isn't a quirk of your audio — it's a model resolving uncertainty toward its English-heavy prior, and streaming creates that uncertainty constantly. You fix it by steering, not guessing: match the model to your real language mix, tell it the language with language_code when you actually know it, feed it conversation context, anchor your vocabulary with keyterms, and give the detector enough audio to commit. Blindly pin a single language onto mixed audio and you'll win the demo and lose production. Steer instead, and the messy, code-switched, real-world call stays in the language it was actually spoken in.

If multilingual reliability is a make-or-break for your product, test with the audio that actually breaks things — short turns, accents, mid-sentence switches, phone-quality lines. That's the audio that tells you whether a model steers or drifts.

Frequently asked questions

Why does my streaming transcription keep switching to English on non-English audio? Streaming models work from short audio windows, so they hit low-confidence moments often — short utterances, code-switching boundaries, noise, and accents. When confidence drops, most models fall back to an English-heavy statistical prior because that's what dominates ASR training data. The fix is to match the model to your language mix and feed it better signals — the language when you know it, plus conversation context — rather than guessing.

Should I force a single language to stop the drift? Only when your audio really is monolingual. In that case, pass language_code to commit Universal-3.5 Pro Realtime to one language — that's the recommended way to steer. But hard-coding one language onto audio that actually code-switches breaks the moment a speaker mixes languages, so for genuinely bilingual calls you leave it off and steer with context instead.

Which AssemblyAI model is best for multilingual audio that code-switches mid-sentence? Universal-3.5 Pro Realtime (universal-3-5-pro). It supports native code-switching across 18 languages in one stream — including Hinglish and other mixed-language speech — and treats mid-sentence switches as ordinary speech rather than re-detecting language each turn, which is what prevents English drift. For languages outside those 18, Whisper-Streaming (whisper-rt) covers 99+ languages.

‍What's the difference between native code-switching and per-turn language detection? Native code-switching transcribes language changes as they happen, including inside a sentence. Per-turn detection only identifies the language at a turn boundary, so a single English word inside a Spanish sentence can flip the whole turn to English. For natural bilingual conversation, native code-switching is the one that resists drift.

How do I reduce language drift in production? Match your audio to Universal-3.5 Pro Realtime, set language_code for monolingual sessions, and lean on the model's rolling conversation memory and agent_context so each turn is decided with context rather than cold. Pair that with keyterms prompting to keep domain vocabulary anchored in the correct language, and voice_focus to stop background speech from dragging confidence down.