Universal-3 Pro (Streaming)
Set up and configure Universal-3 Pro (Streaming) for real-time streaming transcription.
Universal-3 Pro for streaming is optimized for real-time audio utterances typically under 10 seconds, with special efficiencies built in for low-latency turn detection and voice agent workflows. It provides the highest accuracy with native multilingual code switching, entity accuracy, and prompting support.
This model is fantastic for voice agents, agent assist, and all streaming use cases that don’t require partial transcriptions for every single subword — partials are only produced during periods of silence, with at most one partial per silence period (see Partials behavior for details). Universal-3 Pro streaming delivers exceptional entity and alphanumeric accuracy, including credit card numbers, cell phone numbers, email addresses, physical addresses, and names — all with sub-300ms time to complete transcript latency.
Already using AssemblyAI streaming?
If you’re an existing AssemblyAI streaming user, you can quickly test
Universal-3 Pro by switching the speech_model parameter to "u3-rt-pro" in
your connection parameters. No other code changes are required — just update
the model and start streaming.
Quickstart
Get started with Universal-3 Pro streaming using the code below. This example streams audio from your microphone and prints transcription results in real time — no custom prompt is needed, since Universal-3-Pro automatically applies a default prompt optimized for turn detection.
Python SDK
Python
JavaScript SDK
JavaScript
Prompting
Universal-3-Pro supports custom prompts and keyterms prompting to improve transcription accuracy for your use case. For detailed guidance on crafting effective prompts, default prompt behavior, and keyterms prompting, see the Prompting Guide (Streaming).
You can also boost recognition of specific terms using the keyterms_prompt parameter. See Keyterms prompting for details.
Configuring turn detection
Universal-3-Pro uses a punctuation-based turn detection system controlled by two parameters:
When silence reaches min_turn_silence, the model transcribes the audio and checks for terminal punctuation (. ? !):
- Terminal punctuation found — the turn ends and is emitted as a final transcript (
end_of_turn: true). - No terminal punctuation — a partial transcript is emitted (
end_of_turn: false) and the turn continues waiting.- If silence continues to
max_turn_silence, the turn is forced to end as a final transcript (end_of_turn: true) regardless of punctuation.
- If silence continues to
This differs from Universal-Streaming English and Multilingual, which use a confidence-based end-of-turn system controlled by end_of_turn_confidence_threshold.
Instead, Universal-3-Pro makes turn decisions based on ending punctuation after min_turn_silence has elapsed. Because of this, end_of_turn_confidence_threshold has no impact.
end_of_turn and turn_is_formatted
Because formatting is built into the end-of-turn system in Universal-3-Pro
streaming, there is only ever one end-of-turn transcript per turn and it is
always formatted. This means end_of_turn and turn_is_formatted always have
the same value for Universal-3-Pro streaming. You can reliably use
end_of_turn: true to detect a formatted, final end-of-turn transcript.
For example, to configure both parameters:
Partials behavior
Partials are Turn events where end_of_turn is false. They are produced whenever min_turn_silence is met, but the ending punctuation doesn’t signal the end of a turn.
There can be multiple partial transcripts per turn, but each period of silence can produce at most one partial. If silence exceeds min_turn_silence, but speech resumes before max_turn_silence, the partial is emitted and the EOT check resets until the next period of silence.
If you’re running eager LLM inference on partial transcripts, we recommend setting min_turn_silence to 100.
Entity splitting (accuracy) vs Model Latency trade-off
Setting min_turn_silence too low can split entities
like phone numbers and emails. We have found LLM steps fix this for voice
agents, but we recommend testing carefully with your use case.
Formatting and turn detection
Because the model applies punctuation and formatting intelligently, this works well with formatting-based turn detection. For example, based purely on vocal tone:
"Pizza."— Statement"Pizza?"— Questioning tone"Pizza---"— Trailing off
The punctuation quality has been excellent when paired with custom turn detection models.
From testing, mid-turn emission looks like this — where each line is an additional partial leading up to the final end-of-turn transcript:
Each partial is emitted during a silence period within the turn. The final line with terminal punctuation triggers the end of turn.
Forcing a turn endpoint
You can force the current turn to end immediately by sending a ForceEndpoint message:
This is useful when your application knows the user has finished speaking based on external signals (e.g., a button press).
Updating configuration mid-stream
You can update configuration during an active streaming session using UpdateConfiguration. This applies changes without needing to reconnect. The recommended approach is to dynamically update keyterms_prompt based on the current stage of your voice agent flow — if you expect certain answers or terminology at a specific stage, proactively add those as keyterms so the model recognizes them accurately.
For example, if your voice agent is currently asking for the caller’s name and date of birth, send the expected terms for that stage:
Then, when the conversation moves to a different stage (e.g., medical intake), update with the relevant terms:
You can also update prompt, max_turn_silence, min_turn_silence, or any combination at the same time:
Common reasons to update configuration mid-stream:
keyterms_prompt— Dynamically add terms relevant to the current stage of your voice agent flow. This is the most effective way to improve recognition accuracy mid-stream. See Keyterms prompting for details.prompt— Pass updated behavioral or formatting instructions into the STT stream.max_turn_silence— Increase for moments where you’d expect a longer pause, such as when a caller is reading out a credit card number, ID number, or address. Decrease it again afterward to resume snappier turn detection.min_turn_silence— Tune how quickly speculative EOT checks fire. Lower values produce faster partials for eager LLM inference, while higher values reduce entity splitting for utterances with numbers or proper nouns.