Universal-3 Pro Streaming API | AssemblyAI

Overview

This guide is for voice agents that connect AssemblyAI’s Universal-3 Pro Streaming WebSocket directly to a custom LLM and TTS — no LiveKit, Pipecat, or other orchestrator in the loop.

Universal-3 Pro Streaming is optimized for real-time audio under 10 seconds with low-latency turn detection, native multilingual code switching, and prompting support. The protocol is documented in detail on the Universal-3 Pro overview and message sequence pages — this guide focuses on the voice-agent loop and how to handle barge-in and interruptions correctly.

If you’re building on AssemblyAI’s Voice Agent API (a managed endpoint with built-in LLM and turn detection), see Turn detection and interruptions instead — semantic interruption handling is built in there.

Quickstart

A minimal Python consumer that connects to the streaming WebSocket and reacts to Begin, Turn, SpeechStarted, and Termination events:

1 import json
2 import websocket
3 from urllib.parse import urlencode
4 
5 API_KEY = "YOUR_API_KEY"
6 SAMPLE_RATE = 16000
7 
8 CONNECTION_PARAMS = {
9     "sample_rate": SAMPLE_RATE,
10     "speech_model": "u3-rt-pro",
11     "min_turn_silence": 100,
12     "max_turn_silence": 1000,
13 }
14 
15 API_ENDPOINT = (
16     "wss://streaming.assemblyai.com/v3/ws?" + urlencode(CONNECTION_PARAMS)
17 )
18 
19 
20 def on_message(ws, message):
21     data = json.loads(message)
22     msg_type = data.get("type")
23 
24     if msg_type == "Begin":
25         print(f"Session started: {data.get('id')}")
26 
27     elif msg_type == "Turn":
28         transcript = data.get("transcript", "")
29         end_of_turn = data.get("end_of_turn", False)
30         if end_of_turn:
31             # Final transcript — send to your LLM
32             print(f"Final: {transcript}")
33         else:
34             # Partial — optionally start pre-emptive LLM generation
35             print(f"Partial: {transcript}")
36 
37     elif msg_type == "SpeechStarted":
38         # User started speaking — interrupt the agent's TTS if it's playing
39         print("Speech detected — interrupt agent if speaking")
40 
41     elif msg_type == "Termination":
42         print("Session ended")
43 
44 
45 ws = websocket.WebSocketApp(
46     API_ENDPOINT,
47     header={"Authorization": API_KEY},
48     on_message=on_message,
49 )
50 ws.run_forever()

For the full message protocol — including all event fields, audio framing, and termination — see the Universal-3 Pro message sequence reference.

Turn detection

Universal-3 Pro Streaming uses punctuation-based turn detection controlled by two parameters:

Parameter	Default	Description
`min_turn_silence`	`100` ms	Silence before a speculative end-of-turn check fires.
`max_turn_silence`	`1000` ms	Maximum silence before forcing the turn to end.

Lower values produce faster transcripts at the cost of occasional entity splits across turns. See the Universal-3 Pro overview for tuning guidance and the message sequence reference for the full event protocol.

Interruption handling

While the agent is speaking, users often produce backchannel utterances — “mhm”, “yeah”, “um”, “okay” — that you don’t want to treat as interruptions. A barge-in trigger that fires on every SpeechStarted (or every short Turn) will cause the agent to stop mid-sentence even though the user didn’t intend to interrupt.

The recommended fix is a single combined filter applied to each Turn event during agent speech: skip the barge-in if the transcript is short or if every token is a known backchannel. Reset the filter once the agent has finished speaking.

1 import json
2 import string
3 import time
4 import websocket
5 from urllib.parse import urlencode
6 
7 
8 # "yes" / "no" deliberately omitted — in a booking flow a bare "yes"
9 # is a real confirmation. Edit for your domain.
10 BACKCHANNELS = frozenset({
11     "mhm", "mm", "mmhm", "mmhmm",
12     "uh", "uhhuh", "huh",
13     "um", "umm", "uhm",
14     "er", "erm",
15     "hmm", "hm",
16     "ah", "oh",
17     "yeah", "yep", "yup",
18     "okay", "ok",
19     "right", "alright", "gotcha",
20 })
21 
22 _PUNCT_STRIP = str.maketrans("", "", string.punctuation)
23 MIN_WORDS = 2          # Utterances below this are treated as filler
24 FILTER_GRACE_S = 1.0   # Keep filtering for 1s after agent stops speaking
25 
26 
27 # These flags are owned by your TTS layer.
28 agent_speaking = False
29 last_speaking_at = 0.0
30 
31 
32 def _is_all_backchannel(text: str) -> bool:
33     tokens = text.lower().translate(_PUNCT_STRIP).split()
34     return bool(tokens) and all(tok in BACKCHANNELS for tok in tokens)
35 
36 
37 def _should_suppress_interrupt(text: str) -> bool:
38     now = time.monotonic()
39     if agent_speaking:
40         globals()["last_speaking_at"] = now
41     elif now - last_speaking_at > FILTER_GRACE_S:
42         return False
43 
44     word_count = len(text.split())
45     return word_count < MIN_WORDS or _is_all_backchannel(text)
46 
47 
48 def on_message(ws, message):
49     data = json.loads(message)
50     msg_type = data.get("type")
51 
52     if msg_type == "Turn":
53         transcript = data.get("transcript", "")
54         end_of_turn = data.get("end_of_turn", False)
55 
56         if _should_suppress_interrupt(transcript):
57             # Backchannel during agent speech — drop it.
58             return
59 
60         if end_of_turn:
61             handle_user_turn(transcript)  # send to LLM
62         else:
63             handle_partial(transcript)
64 
65     elif msg_type == "SpeechStarted":
66         if agent_speaking:
67             # Don't interrupt yet — wait for the Turn event,
68             # which is gated by _should_suppress_interrupt above.
69             return
70         # Otherwise: normal barge-in path.

How it works:

While the agent is speaking (plus a 1-second grace window after speech ends), each Turn event is checked.
_should_suppress_interrupt returns True when the transcript has fewer than MIN_WORDS tokens or when every token is a known backchannel. Either condition drops the event.
Utterances with any non-filler content past the threshold (e.g., “yeah I’d like the suite”) always pass through.
SpeechStarted is gated through the same Turn-level check rather than firing barge-in directly — this prevents a race where a backchannel triggers SpeechStarted before the gating logic sees the transcript.

The BACKCHANNELS set is domain-customizable. “yes” and “no” are deliberately excluded because they often represent genuine confirmations in booking or IVR scenarios. Edit the set for your use case. MIN_WORDS = 2 is a reasonable default — raise it if you see frequent two-word filler (“uh okay”, “yeah right”) slipping through.

If you’re building on LiveKit, prefer LiveKit’s adaptive interruption handling — see LiveKit > Interruption handling. The strategy on this page is the equivalent for direct-WebSocket integrations.

Recommended configuration

Three presets covering most voice-agent use cases:

1 # Fast — quick confirmations, IVR, yes/no questions
2 fast_params = {
3     "speech_model": "u3-rt-pro",
4     "min_turn_silence": 100,
5     "max_turn_silence": 800,
6 }
7 
8 # Balanced — most voice agent conversations (recommended)
9 balanced_params = {
10     "speech_model": "u3-rt-pro",
11     "min_turn_silence": 100,
12     "max_turn_silence": 1000,
13 }
14 
15 # Patient — entity dictation, complex instructions
16 patient_params = {
17     "speech_model": "u3-rt-pro",
18     "min_turn_silence": 200,
19     "max_turn_silence": 2000,
20 }

For cross-cutting topics like dynamic configuration updates, scaling, latency budgeting, and evals, see the voice agent best practices guide.