Releases & Updates

Universal-3.5 Pro Realtime: the first streaming STT model that takes the agent's question as input

Our new flagship realtime model takes direction from your agent, remembers the conversation on its own, and hears the speaker instead of the room — now across 18 languages, at the same $0.45/hr.

Written by

AssemblyAI Team

Published on

23 June 2026

A customer spells out an email address and the agent writes "user at gmail dot com" as a sentence. A caller slides from Hindi to English mid-sentence and the transcript loses the thread. Neither is an edge case. Both happen because the model hears each moment on its own, with no sense of what came before it, and no conversation works that way.

Today we're releasing Universal-3.5 Pro Realtime, our new flagship realtime model. Two things define this release. The first is context: the model takes direction from your agent, remembers the conversation on its own, and hears the speaker instead of the room. The second is languages: 18 of them at full accuracy, with mid-sentence code-switching, plus steering to commit the model to just one when you already know it. Same $0.45/hr. Same pipeline. One line to upgrade.

Context: your agent knows the question. Now the model does too.

A voice agent has something no transcription model has ever had access to: it knows what it just asked. Universal-3.5 Pro Realtime closes that gap. Pass the question in with agent_context and the model hears the reply through the lens of the question. Prime it with "What's your email address?" and a mumbled answer resolves to user@assemblyai.com instead of "user at assembly a i dot com." Spelled-out account IDs, street addresses, one-word confirmations: the short utterances that wreck most realtime models finally have the context to come out right.

And it's measurable. Across a benchmark of 20,000 voice agent audio files, passing agent context cut word error rate by 10.2%, with the gains concentrated exactly where agents hurt most.

Word error rate reduction

Fabrications -18.3%

Hallucinations -17.2%

Place-name entities -15.5%

Short-utterance errors -13.7%

Name entities -9.4%

Medical entities -9.4%

Technical entities -7.0%

Entity errors (overall) -5.1%

Even when you pass nothing, the model no longer starts each turn cold. It keeps a short, rolling memory of the conversation and uses it as context for whatever comes next. On by default. Nothing to configure.

This is where accuracy compounds. One voice agent team paired agent context with prompting and cut their utterance error rate from 26% to 9% on their own production audio.

Low-latency STT with access to more context is exactly what I've wanted to see from next-generation models. The Context Carryover feature of Universal 3.5 Pro delivers on that.

Kwindla Hultman Kramer CEO at Pipecat

Sharpen it further with prompting

Universal-3.5 Pro is highly accurate out of the box, but for challenging audio like short clips with limited context, noisy environments, or audio with very niche references, providing a brief description in the prompt parameter can meaningfully improve accuracy. For example, here's a 2-second clip from a League of Legends pro interview:

League of Legends pro interview

0:00 / 0:00

Prompt

This is a League of Legends Pro Interview.

Without prompting

And so look who I’ve been a dear.

With prompting

In solo queue, I ban Azir.

Context goes beyond the words

It's also whose voice matters, what the full call reveals, and what you know that the model doesn't:

It hears the speaker, not the room. Background speech is worse than background noise: a TV or a passenger doesn't just add static, it gets transcribed as words, and phantom words fire false interruptions. voice_focus isolates the primary speaker and suppresses everything else. Use near_field for headsets and phones, far_field for rooms, kiosks, and drive-thrus.

It gets a second look at every speaker. Speaker labels are weakest in the opening seconds, before the model has heard enough of each voice to tell them apart. So the model labels speakers live, then re-clusters every voice when the stream ends and sends a single revision correcting any labels it now knows were wrong. Live labels during the call, async-grade accuracy within about half a second of it ending, up to 10 speakers.

It takes the context you bring. Feed it your domain vocabulary with keyterm prompting and "metoprolol succinate" doesn't turn into something else. Update the prompt over the same WebSocket as the call moves from verification to troubleshooting to escalation.

All of it shows up where voice agents are actually measured. On Pipecat's open STT benchmark, real agent conversations rather than clean read speech, Universal-3.5 Pro Realtime posts a market-leading pooled word error rate of 6.99%.

Languages: flagship accuracy in 18, and steering when you know which one

Universal-3 Pro Realtime ran in six languages. Universal-3.5 Pro Realtime runs in 18: English, Spanish, French, German, Italian, Portuguese, Arabic, Danish, Dutch, Hebrew, Hindi, Japanese, Mandarin, Vietnamese, Finnish, Norwegian, Swedish, and Turkish, all at flagship accuracy, with mid-sentence code-switching so bilingual calls (Hinglish included) never pause for the model to catch up.

See Universal-3.5 Pro code-switching in action:

0:00 / 0:00

English ↔ French

I said something like, j’ai dit à mes étudiants que it’s time to really pay attention to what the idea of code-switching is.

Automatic detection is the right default when you don't know what's coming. Most production calls aren't a guessing game, though: a support line in Osaka runs in Japanese, a clinic intake in Madrid runs in Spanish. When you know the language, tell the model. The new language_code parameter commits it to one language instead of asking it to detect one, the cleanest way to head off the wrong-language slips that creep in on short or ambiguous audio, and now the recommended way to bias toward a single language. For calls that genuinely mix languages, keep using prompting. That's what it's built for.

Languages supported by Universal-3.5 Pro Realtime

Choose your mode

For live audio, open a WebSocket and pick a mode instead of hand-tuning a stack of low-level flags: min_latency for the fastest transcripts, balanced (the default) for strong all-around performance, or max_accuracy for noisy, far-field audio. End-of-turn detection reads how someone speaks, tonality, pacing, rhythm, not just silence, and lands around 300ms.

How it compares

Accuracy claims are easy to make and hard to verify, so here's how Universal-3.5 Pro Realtime stacks up against the realtime models teams most often evaluate against us. These are the metrics that decide whether a voice agent feels responsive and gets the hard parts right: pooled word error rate on real agent conversations (measured with Pipecat), short utterance error rate, fabrication error rate, omission error rate, and hallucination error rate.

6.99%

15.58%

9.76%

9.04%

Universal-3.5 Pro Realtime

Deepgram Flux

ElevenLabs Scribe v2

Google Chirp3

Full benchmark, all metrics (lower is better)

Metric	Universal-3.5 Pro Realtime	Deepgram Flux	ElevenLabs Scribe v2	Google Chirp3
Word error rate	6.99%	15.58%	9.76%	9.04%
Entity error rate	15.31%	50.50%	39.70%	21.51%
Names	16.92%	39.21%	38.03%	22.10%
Places	6.28%	14.86%	34.06%	10.04%
Phone numbers	3.55%	10.41%	4.78%	4.95%

Same audio, side by side

Real transcripts from the benchmark above, on files where Universal-3.5 Pro Realtime scored a perfect word error rate. Press play, then switch competitors to hear and read exactly what each model returned.

A phone number gets the digits wrong

A caller reads a number to dial back. Every competitor changes a digit or invents one, so the agent calls the wrong line.

0:00 / 0:00

Truth

Uh, please check why calls from Amazon to 555-133-3139 keep dropping.

AssemblyAI

Uh, please check why calls from Amazon to 555-133-3139 keep dropping.

Competitor

Please check, like, calls from Amazon to five five five dash one three two dash two one three nine keep dropping.

Trusted by the teams building voice agents

Universal-3.5 Pro Realtime is already live across the platforms voice teams build on, alongside the API direct from us. The teams running production traffic on it are the ones who pushed hardest on accuracy, and here's what they're seeing.

For Retell, where agents handle real phone calls in regulated industries, the bar isn't "good enough." It's getting every word right.

Retell is the platform teams use to build and deploy voice AI agents for automating real-world phone calls. In industries like healthcare and finance, getting a word wrong isn't an option: accuracy has to win over speed every time. That's exactly why we're excited about high accuracy mode powered by AssemblyAI Universal 3.5-Pro.

Bing Wu Co-Founder & CEO at Retell AI

LiveKit makes the model available through LiveKit Inference, and what stood out to them was the part most speech models don't do: using the conversation itself to get sharper.

We're excited to make AssemblyAI's Universal-3.5 Pro available on LiveKit Inference. What really stands out is their pace of innovation with Context Carryover — it intelligently applies conversation context to improve transcription accuracy in a way most speech models don't, removing the need for users to predefine key terms.

David Zhao Co-founder at LiveKit

And for teams evaluating models head-to-head for their own agent pipelines, the combination of accuracy, latency, and language handling is what settles it.

We were searching for the best realtime ASR model for our voice agent pipeline in Fireflies. The new Universal 3.5 Pro speech model from Assembly is best so far in terms of accuracy, latency and language switching.

Foysal Osmany Software Engineer at Fireflies

More model, same price

$0.45/hr ($0.0075/min) base, unchanged from Universal-3 Pro Realtime. Context is included, both the rolling memory and agent_context, and so is keyterm prompting.

Add-ons stack only as you use them: diarization with revision (+$0.12/hr), prompting (+$0.05/hr), voice isolation (+$0.10/hr). Unlimited concurrency, no rate limits, no upfront commitments. Volume discounts at scale.

One line to upgrade

Universal-3.5 Pro Realtime is the new default for realtime transcription and the speech foundation under our Voice Agent API, so anything you've built on it gets sharper today. Most teams get the upgrade automatically. If you pin model versions, point speech_model at the new model. Your current model stays available as a pinnable snapshot, and we'll publish a migration timeline so you can move on your own schedule.

Test it in the Playground with your own audio, or read the realtime guide to get started.

Context: your agent knows the question. Now the model does too.

Sharpen it further with prompting

Context goes beyond the words

Languages: flagship accuracy in 18, and steering when you know which one

Choose your mode

How it compares

Word error rate on agent conversations

Trusted by the teams building voice agents

More model, same price

One line to upgrade