Universal-3 Pro Streaming API
Universal-3 Pro Streaming API
Overview
This guide is for voice agents that connect AssemblyAI’s Universal-3 Pro Streaming WebSocket directly to a custom LLM and TTS — no LiveKit, Pipecat, or other orchestrator in the loop.
Universal-3 Pro Streaming is optimized for real-time audio under 10 seconds with low-latency turn detection, native multilingual code switching, and prompting support. The protocol is documented in detail on the Universal-3 Pro overview and message sequence pages — this guide focuses on the voice-agent loop and how to handle barge-in and interruptions correctly.
If you’re building on AssemblyAI’s Voice Agent API (a managed endpoint with built-in LLM and turn detection), see Turn detection and interruptions instead — semantic interruption handling is built in there.
Quickstart
A minimal Python consumer that connects to the streaming WebSocket and reacts to Begin, Turn, SpeechStarted, and Termination events:
For the full message protocol — including all event fields, audio framing, and termination — see the Universal-3 Pro message sequence reference.
Turn detection
Universal-3 Pro Streaming uses punctuation-based turn detection controlled by two parameters:
Lower values produce faster transcripts at the cost of occasional entity splits across turns. See the Universal-3 Pro overview for tuning guidance and the message sequence reference for the full event protocol.
Interruption handling
While the agent is speaking, users often produce backchannel utterances — “mhm”, “yeah”, “um”, “okay” — that you don’t want to treat as interruptions. A barge-in trigger that fires on every SpeechStarted (or every short Turn) will cause the agent to stop mid-sentence even though the user didn’t intend to interrupt.
The recommended fix is a single combined filter applied to each Turn event during agent speech: skip the barge-in if the transcript is short or if every token is a known backchannel. Reset the filter once the agent has finished speaking.
How it works:
- While the agent is speaking (plus a 1-second grace window after speech ends), each
Turnevent is checked. _should_suppress_interruptreturnsTruewhen the transcript has fewer thanMIN_WORDStokens or when every token is a known backchannel. Either condition drops the event.- Utterances with any non-filler content past the threshold (e.g., “yeah I’d like the suite”) always pass through.
SpeechStartedis gated through the sameTurn-level check rather than firing barge-in directly — this prevents a race where a backchannel triggersSpeechStartedbefore the gating logic sees the transcript.
The BACKCHANNELS set is domain-customizable. “yes” and “no” are deliberately excluded because they often represent genuine confirmations in booking or IVR scenarios. Edit the set for your use case. MIN_WORDS = 2 is a reasonable default — raise it if you see frequent two-word filler (“uh okay”, “yeah right”) slipping through.
If you’re building on LiveKit, prefer LiveKit’s adaptive interruption handling — see LiveKit > Interruption handling. The strategy on this page is the equivalent for direct-WebSocket integrations.
Recommended configuration
Three presets covering most voice-agent use cases:
For cross-cutting topics like dynamic configuration updates, scaling, latency budgeting, and evals, see the voice agent best practices guide.