Documentation Index
Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Overview
This guide covers integrating AssemblyAI’s Universal-3 Pro Streaming speech-to-text model into a LiveKit voice agent using the Agents framework.Turn detection
In LiveKit, how your agent detects the end of a user’s turn is controlled by theturn_detection parameter inside TurnHandlingOptions, which is passed to AgentSession via the turn_handling argument.
Universal-3 Pro Streaming uses a punctuation-based turn detection system, which checks for terminal punctuation (. ? !) after periods of silence rather than using a confidence score.
This means the min_turn_silence and max_turn_silence parameters you pass to AssemblyAI directly control when transcripts are emitted and when turns end. For more details on how this works, see Configuring turn detection.
Default parameter differences
Universal-3 Pro Streaming’s endpointing is controlled by two AssemblyAI API parameters,min_turn_silence and max_turn_silence, that you pass to the STT plugin. These are separate from LiveKit’s endpointing.min_delay and endpointing.max_delay (set inside TurnHandlingOptions).
| Parameter | AssemblyAI API default | LiveKit plugin default | Description |
|---|---|---|---|
min_turn_silence | 100 ms | 100 ms | Silence before a speculative end-of-turn check. If terminal punctuation (. ? !) is found, the turn ends. If not, a partial is emitted and the turn continues. |
max_turn_silence | 1000 ms | 100 ms | Maximum silence before forcing the turn to end, regardless of punctuation. |
turn_detection="stt", you should explicitly set max_turn_silence=1000 if you’d like to mimic the behavior of streaming directly to the API without LiveKit.
Tuning endpointing parametersThese are the default values used when no parameters are explicitly provided. You will likely need to experiment with different values depending on your use case:
- Increase
min_turn_silence: when brief pauses cause the speculative EOT check to fire too early, ending turns on terminal punctuation before the user has finished speaking. - Increase
max_turn_silence: when the forced turn end is cutting off users mid-thought or splitting entities like phone numbers across turns, a higher value lets the model wait longer before forcing the turn to end when the model is unsure.
STT-based Turn Detection (recommended)
Withturn_detection="stt", AssemblyAI’s built-in punctuation-based turn detection determines when the user has finished speaking. AssemblyAI’s end_of_turn signals are then used directly by LiveKit to commit the turn.
In this mode, we recommend explicitly setting min_turn_silence=100 and max_turn_silence=1000. These are AssemblyAI’s API defaults and provide a good balance of responsiveness and accuracy.
The LiveKit plugin defaults to min_turn_silence=100 and max_turn_silence=100, which might be too aggressive for STT-based turn detection.
Recommended starting parameters (set on assemblyai.STT(), not on AgentSession):
| Parameter | Default | Description |
|---|---|---|
min_turn_silence | 100 ms | Silence duration before a speculative end-of-turn (EOT) check fires. |
max_turn_silence | 1000 ms | Maximum silence before a turn is forced to end. |
- User speaks → audio streams to AssemblyAI
- User pauses for
100ms→ AssemblyAI checks for terminal punctuation - If terminal punctuation (
.?!) → turn ends immediately - If no terminal punctuation → partial emitted, turn continues waiting
- If silence reaches
1000ms→ turn is forced to end regardless of punctuation
LiveKit turn detection (with MultilingualModel())
As a third-party turn detection model, LiveKit’s turn detector runs on top of STT output to make turn decisions. AssemblyAI’s role is then just to provide transcripts as quickly as possible, while the turn detection model decides when the user is actually done speaking.
Use MultilingualModel() rather than EnglishModel(), as Universal-3 Pro Streaming supports English, Spanish, German, French, Portuguese, and Italian. MultilingualModel() covers support for all of these languages.
The LiveKit plugin defaults of min_turn_silence=100 and max_turn_silence=100 work well here, as max_turn_silence is brought down to match min_turn_silence so that transcripts are handed off to the turn detection model as fast as possible.
MultilingualModel parameters (set inside TurnHandlingOptions → endpointing, not on the STT plugin):
| Parameter | Default | Description |
|---|---|---|
endpointing.min_delay | 0.5 s | Time to wait before committing a turn when the model predicts a likely boundary. |
endpointing.max_delay | 3.0 s | Maximum time to wait when the model predicts the user will continue speaking. Has no effect without a turn detector model. |
- User speaks → audio streams to AssemblyAI
- User pauses for
100ms→ AssemblyAI emits transcript (final and partial are the same) immediately - LiveKit’s
MultilingualModel()evaluates the transcript in conversational context - If the model predicts a likely turn boundary → waits
min_delay(0.5s) then commits the turn - If the model predicts the user will continue → waits up to
max_delay(3.0s) for more speech
Other turn detection modes
-
vad:- Detect end of turn from speech and silence data alone using Silero VAD.
- Turn boundaries are determined purely by voice activity without semantic context.
- AssemblyAI’s turn detection parameters still control when transcripts are emitted, but it is recommended to leave them at the plugin defaults (
min_turn_silence=100,max_turn_silence=100) so transcripts arrive as quickly as possible.
-
manual:- Disable automatic turn detection entirely.
- You control turns explicitly using
session.commit_user_turn(),session.clear_user_turn(), andsession.interrupt(). - See the manual turn control docs for details.
Entity splitting tradeoff
Lowermin_turn_silence and max_turn_silence values produce faster transcripts but can split entities or utterances across turns. The two parameters affect this differently.
min_turn_silence too low
- Speculative check fires too early, splitting entities on punctuation.
- Example: User spells out an email address with brief pauses between parts. The speculative check fires at 100ms of silence, and the model adds terminal punctuation to each segment, ending the turn prematurely.
max_turn_silence too low
- Forced turn-end cuts off user mid-thought.
- Example: User pauses longer than 1 second to think mid-sentence. The forced end fires at 1000ms, splitting the utterance into two turns regardless of punctuation.
LLMs downstream can usually piece together split entities, but if your use
case involves alphanumeric dictation or entity extraction, consider increasing
min_turn_silence and max_turn_silence during those portions of the
conversation. You can update configuration
mid-stream to raise max_turn_silence
temporarily (e.g., to 2000–4000 ms) when expecting entity input, then
lower it again afterward.min_turn_silence or max_turn_silence if users are likely to speak slowly or dictate entities. While this adds latency, it improves accuracy by giving the model more audio context before emitting a transcript and keeping the full entity complete within the same turn.
VAD configuration
Withturn_detection="stt", AssemblyAI also sends SpeechStarted events that LiveKit uses for barge-in/interruption handling. SpeechStarted is only emitted when the model produces a transcript.
Silero VAD is not strictly required in this mode, but it is still recommended as Silero runs locally and it can be faster than waiting for AssemblyAI’s SpeechStarted signal. LiveKit respects whichever signal arrives first, so Silero provides faster interruption while AssemblyAI’s signal serves as a reliable backup.
With MultilingualModel(), Silero VAD is required, as it is the only source of START_OF_SPEECH events for interruption in this mode. AssemblyAI’s SpeechStarted event is not used.
Threshold alignment
LiveKit’s Silero VAD defaults to anactivation_threshold of 0.5. AssemblyAI’s vad_threshold defaults to 0.3. For best performance, we recommend setting both to 0.3.
Both should be adjusted together to the same value to ensure accurate transcription and consistent barge-in thresholds.
When the thresholds are mismatched, you get a dead zone: if Silero is at 0.5 and AssemblyAI is at 0.3, AssemblyAI will be actively transcribing speech that LiveKit hasn’t detected yet, delaying interruption. Keeping them aligned eliminates this.
stt.vad_threshold and vad.activation_threshold thresholds together.
Interruption handling
In voice agent conversations, users often produce backchannel utterances (“mhm”, “yeah”, “um”, “okay”) while the agent is speaking. These short fillers can trigger LiveKit’s interruption logic, causing the agent to stop mid-sentence even though the user didn’t intend to interrupt. LiveKit Cloud users should reach for adaptive interruption handling first. Self-hosted deployments can combine the two custom filters described below.Recommended: Adaptive interruption handling
LiveKit Agents v1.5.0 introduced adaptive interruption handling, a server-side model that classifies overlapping speech as a real interrupt or a backchannel. On a false interrupt the agent’s TTS resumes from where it left off. No re-generation is needed. The feature is recommended for LiveKit Cloud users and is included on all Cloud plans.Self-hosted alternative
If you’re self-hosting LiveKit, can’t use adaptive interruption handling, or need explicit control over which utterances count as filler, the two complementary filters below provide strong guardrails for interruption and barge-in handling. A working reference implementation is available on GitHub.Upstream backchannel filter
The backchannel filter intercepts STT events at thestt_node level, before they reach LiveKit’s audio_recognition. It checks each transcript event for known disfluencies and backchannels (“um”, “mhm”, “yeah”, “okay”, etc.) and drops the event entirely. Because the event never enters the pipeline, none of the downstream orchestration (interrupt gates, end-of-turn detection, and preemptive LLM generation) ever reacts to it.
The filter is implemented as a mixin class that wraps Agent.stt_node:
- While the agent is speaking (plus a 1-second grace window after speech ends), the filter inspects each STT transcript event
- It strips punctuation and checks whether every token in the transcript matches the
BACKCHANNELSset - Pure-filler transcripts like “mhm” or “yeah okay” are dropped. They never reach LiveKit’s pipeline
- Utterances with any non-filler token (e.g., “yeah I want the suite”) always pass through
The
BACKCHANNELS set is domain-customizable. “yes” and “no” are deliberately excluded because they often represent genuine confirmations in booking or IVR scenarios. Edit the set for your use case.Short-utterance buffer clearing
LiveKit accumulates committedFINAL transcripts in a private buffer (_audio_transcript) across the user’s uncommitted turn. The _interrupt_by_audio_activity method checks the running word count against min_words to decide whether to pause TTS.
Without intervention, two consecutive short fillers like “yeah” + “um” sum to two words and trip the interrupt gate, even though each utterance on its own is below threshold.
This filter listens on the user_input_transcribed event and wipes the buffer whenever the agent is speaking and the user input falls below the configured min_words threshold. Each short utterance is evaluated independently rather than against the accumulated total.
Wiring both filters into your agent
The two filters have non-overlapping failure modes: the backchannel filter stops known fillers before they enter the pipeline, while the buffer-clearing filter catches any short utterance (including unknown fillers or stutters) that slips past. Running both provides the strongest coverage. Three changes are needed: 1. Mix in the backchannel filter on your agent class. Place it beforeAgent in the class bases so its stt_node runs first:
interruption.min_words >= 2. This enables the word-count gate that the buffer-clearing filter depends on:
Expected behavior
The table below shows how utterances are handled when both filters are active withmin_words=2:
| User utterance (during agent speech) | Backchannel filter | Buffer-clearing filter | Result |
|---|---|---|---|
| ”mhm” | Dropped | - | Agent continues |
| ”um” | Dropped | - | Agent continues |
| ”yeah yeah” | Dropped | - | Agent continues |
| Unknown short word | Passes through | Buffer cleared | Agent continues |
| ”yeah I’d like the suite” | Passes through | Passes through | Agent interrupts |
| ”suite please” | Passes through | Passes through | Agent interrupts |
| ”mhm” (1+ second after agent stops) | Passes through | - | Agent responds |
Prompt engineering
Universal-3 Pro Streaming supports aprompt parameter for custom transcription instructions. When no prompt is provided, a default prompt optimized for native (i.e. STT-based) turn detection is used automatically.
- Start with no prompt: the default prompt delivers strong accuracy out of the box, only add a custom prompt if you need to alter this behavior.
- Specify the audio context: accent, domain, expected utterance length, etc.
- Define punctuation rules: can improve downstream LLM processing
- Preserve speech patterns: instruct the model to keep disfluencies and filler words for more natural agent interactions
Key terms boosting
Instead ofprompt, use keyterms_prompt to boost recognition of specific names, brands, or domain terms:
Updating configuration mid-stream
You can updateprompt, keyterms_prompt, min_turn_silence, and max_turn_silence during an active session using update_options.
This is useful for dynamically adjusting turn detection behavior, like increasing max_turn_silence when expecting entity dictation, then lowering it again afterward. You can also update keyterms_prompt and prompt mid-stream after a database is loaded or crucial conversational information has been detected. For more information, see update configuration mid-stream.
Build and run your agent
Installation
Install the plugin and necessary packages (silero, codecs, dotenv) from PyPI:MultilingualModel(), you also need to install the turn detector plugin:
Noise cancellation can introduce audio artifacts that negatively impact
transcription quality. In most cases, the artifacts introduced by noise
cancellation cause more harm than the background noise itself, so we recommend
not adding any audio pre-processing before it reaches Universal-3 Pro
Streaming.
Authentication
Set your API keys in a.env file:
Recommended configuration
The following example usesturn_detection="stt" (recommended).
Pay close attention to the comments for using with MultilingualModel().
Running your agent
Start in development mode
Test in the LiveKit Playground
- Go to agents-playground.livekit.io
- Connect to your LiveKit Cloud project (same credentials as your
.env) - Click Connect: a room will be created, your agent will join, and you can start talking
Parameters reference
Universal-3 Pro Streaming parameters
These are the key parameters to tune for LiveKit when using Universal-3 Pro Streaming:Set to
"u3-rt-pro" for Universal-3 Pro Streaming.List of terms to boost recognition for. Appended to the default prompt
automatically.
Custom transcription instructions for the model. When not provided, a default
prompt optimized for native turn detection is automatically applied.
Milliseconds of silence before a speculative end-of-turn check. When the check
fires, the model looks for terminal punctuation to decide whether the turn has
ended.
Maximum milliseconds of silence before the turn is forced to end, regardless
of punctuation. The LiveKit plugin defaults to
100. Set to 1000 when
using turn_detection="stt".AssemblyAI’s internal Silero VAD threshold. Universal-3 Pro Streaming
defaults to
0.3, unlike Universal-Streaming’s 0.4. Align with LiveKit’s
Silero activation_threshold for consistent behavior.Universal-3 Pro Streaming code-switches natively between supported
languages. This parameter controls whether
language_code and
language_confidence are included in turn messages. Defaults to true in the
LiveKit plugin, but false when using the API directly.General STT parameters
These parameters apply to all AssemblyAI streaming models and can remain the same between models:The sample rate of the audio stream.
The encoding of the audio stream. Allowed values:
pcm_s16le, pcm_mulaw.Legacy parameters
These parameters apply to theuniversal-streaming-english and universal-streaming-multilingual AssemblyAI streaming models, but do not affect Universal-3 Pro Streaming:
Confidence threshold for end-of-turn detection. Universal-3 Pro Streaming
uses punctuation-based turn detection instead.
Whether to return formatted final transcripts. Universal-3 Pro Streaming
always returns formatted transcripts, so this parameter no longer applies.
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
Extra latency with turn_detection="stt" | LiveKit’s endpointing min_delay is additive in STT mode | Set endpointing={"min_delay": 0} inside TurnHandlingOptions on AgentSession |
| No interruption handling | Missing VAD | Ensure vad=silero.VAD is set, with activation_threshold equal to vad_threshold (default 0.3) |
| Turn over-segmentation | min_turn_silence too low | Increase from 100 to 200–500 |
| Entities split across turns | max_turn_silence too low | Increase max_turn_silence (e.g., 1500–3500) |
| Latency on non-terminal utterances | max_turn_silence too high | Lower max_turn_silence |
Migration from standard AssemblyAI STT
If you are migrating from the standard AssemblyAI streaming model:| Change | From | To |
|---|---|---|
| Model | assemblyai.STT() | assemblyai.STT(model="u3-rt-pro") |
| Turn detection | turn_detection="stt" or EnglishModel() | turn_detection="stt" or MultilingualModel() |
| VAD | Optional | Set vad=silero.VAD.load() to match vad_threshold |
min_turn_silence | 400 (old default) | 100 (new default) |
max_turn_silence | 1280 (old default) | 1000 (API default) or 100 (with 3rd-party turn detector) |
end_of_turn_confidence_threshold | Configurable | Not applicable: Universal-3 Pro Streaming uses punctuation-based turn detection |
endpointing.min_delay (formerly min_endpointing_delay) | Default 0.5 | Set to 0 inside TurnHandlingOptions when using turn_detection="stt" |