Overview
This guide covers integrating AssemblyAI’s Universal 3.5 Pro Realtime speech-to-text model into a LiveKit voice agent using the Agents framework. Everything here applies equally to Universal-3 Pro Streaming (u3-rt-pro) — both belong to the same U3 Pro family and share every parameter in this guide, so you can swap the model string without changing anything else.
Universal 3.5 Pro Realtime is in preview. It’s the next generation of our flagship streaming model, with broader language coverage, improved prompting, and enhanced conversation context.To try it, set
model="universal-3-5-pro" and install the latest livekit-agents — 1.6+ is required for Universal 3.5 Pro Realtime, Conversation Context, and Voice Focus.Turn detection
Decide when the user is done speaking — modes, defaults, and entity tuning.
Latency
Shorten the gap between the user finishing and the agent replying.
Accuracy
Prompting, key terms, conversation context, and noise handling.
Interruptions
Natural barge-in without false triggers from backchannels.
Quickstart
Get a working, talking agent in a few minutes, then optimize from there.Install the SDK
Install the plugin and supporting packages (silero, codecs, dotenv) from PyPI:If you plan to use LiveKit turn detection with
MultilingualModel(), also install the turn detector plugin:Build a minimal agent
The following example uses For a complete voice agent, you will also need to install LLM and TTS plugins for your chosen providers. See the LiveKit plugins documentation for available options.
turn_detection="stt" (recommended). Pay close attention to the comments for using with MultilingualModel().Run and test
Start your agent in development mode:Then test it in the LiveKit Playground:
- Go to agents-playground.livekit.io
- Connect to your LiveKit Cloud project (same credentials as your
.env) - Click Connect: a room will be created, your agent will join, and you can start talking
Parameters reference
Universal 3.5 Pro Realtime parameters
These are the key parameters to tune for LiveKit when using Universal 3.5 Pro Realtime:The streaming model. Defaults to
"universal-3-5-pro" (preview); also accepts
"u3-rt-pro" (the recommended GA model). Both belong to the U3 Pro family and
share every parameter below.Accuracy/latency preset:
"min_latency", "balanced", or "max_accuracy". Sets
the defaults for mode-dependent fields (min_turn_silence, vad_threshold,
interruption_delay, continuous_partials, previous_context_n_turns); any value
you set explicitly still takes precedence. Leave unset for the server’s default
preset. Connect-time only — cannot be changed with update_options. U3 Pro family
only. See Optimizing accuracy and latency.List of terms to boost recognition for. Applied automatically alongside any
contextual prompt.
Contextual prompt — a natural-language description of what the audio is
about (domain, scenario, or full details). Transcription behavior is built
in and optimized automatically.
Your agent’s most recent spoken reply, up to ~1500 characters, used as context
for transcribing the next user turn. Can be set at construction time and
updated mid-stream with
update_options. U3 Pro family only. See
Conversation context.How many prior conversation entries are carried forward automatically. Range
0–100; 0 disables carryover. Leave unset for the server default (3).
Connect-time only — cannot be changed with update_options. U3 Pro family
only.Milliseconds of silence before a speculative end-of-turn check. When the check
fires, the model looks for terminal punctuation to decide whether the turn has
ended.
Maximum milliseconds of silence before the turn is forced to end, regardless
of punctuation. The LiveKit plugin defaults to
100. Set to 1000 when
using turn_detection="stt".AssemblyAI’s internal Silero VAD threshold. Universal 3.5 Pro Realtime
defaults to
0.3, unlike Universal-Streaming’s 0.4. Align with LiveKit’s
Silero activation_threshold for consistent behavior.Server-side noise suppression that isolates the primary speaker.
"near-field"
for close-talking mics, "far-field" for distant capture. Connect-time only.
U3 Pro family only. See Voice focus.How aggressively
voice_focus suppresses background audio. 0.0–1.0; higher
is more aggressive. Connect-time only. U3 Pro family only.Whether to emit additional partial transcripts during long turns at a steady
~3 second cadence. When enabled (default on both the API and the LiveKit
plugin), additional partials covering the full turn transcript are emitted
approximately every 3 seconds while speech continues. When disabled, only one
early partial is emitted near turn start. The first partial (at 750ms) is
unaffected. Useful when downstream consumers (LLMs, UI, eager inference) need
frequent updates during long, uninterrupted turns. See
Continuous partials
for details.
How soon the first partial transcript is emitted during a turn, in
milliseconds. Range:
0–1000. Lower values produce faster time to first
token (TTFT) for barge-in and speculative inference; higher values produce
more confident first partials. The server adds a minimum of 300ms on top of
the configured value (interruption_delay: 0 → ~300ms effective,
interruption_delay: 500 → ~800ms effective). See
Tuning early partial timing
for details.Universal 3.5 Pro Realtime code-switches natively between supported
languages. This parameter controls whether
language_code and
language_confidence are included in turn messages. Defaults to true in the
LiveKit plugin, but false when using the API directly.General STT parameters
These parameters apply to all AssemblyAI streaming models and can remain the same between models:The sample rate of the audio stream.
The encoding of the audio stream. Allowed values:
pcm_s16le, pcm_mulaw.Legacy parameters
These parameters apply to theuniversal-streaming-english and universal-streaming-multilingual AssemblyAI streaming models, but do not affect Universal 3.5 Pro Realtime:
Confidence threshold for end-of-turn detection. Universal 3.5 Pro Realtime
uses punctuation-based turn detection instead.
Whether to return formatted final transcripts. Universal 3.5 Pro Realtime
always returns formatted transcripts, so this parameter no longer applies.
Turn detection
In LiveKit, how your agent detects the end of a user’s turn is controlled by theturn_detection parameter inside TurnHandlingOptions, which is passed to AgentSession via the turn_handling argument.
Universal 3.5 Pro Realtime uses a punctuation-based turn detection system, which checks for terminal punctuation (. ? !) after periods of silence rather than using a confidence score.
This means the min_turn_silence and max_turn_silence parameters you pass to AssemblyAI directly control when transcripts are emitted and when turns end. For more details on how this works, see Configuring turn detection.
Default parameter differences
Universal 3.5 Pro Realtime’s endpointing is controlled by two AssemblyAI API parameters,min_turn_silence and max_turn_silence, that you pass to the STT plugin. These are separate from LiveKit’s endpointing.min_delay and endpointing.max_delay (set inside TurnHandlingOptions).
| Parameter | AssemblyAI API default | LiveKit plugin default | Description |
|---|---|---|---|
min_turn_silence | 100 ms | 100 ms | Silence before a speculative end-of-turn check. If terminal punctuation (. ? !) is found, the turn ends. If not, a partial is emitted and the turn continues. |
max_turn_silence | 1000 ms | 100 ms | Maximum silence before forcing the turn to end, regardless of punctuation. |
continuous_partials | true | true | Emit partial transcripts every ~3 seconds during long turns for steady mid-turn updates to downstream consumers. Enabled by default on both the API and the LiveKit plugin. |
turn_detection="stt", you should explicitly set max_turn_silence=1000 if you’d like to mimic the behavior of streaming directly to the API without LiveKit.
Tuning endpointing parametersThese are the default values used when no parameters are explicitly provided. You will likely need to experiment with different values depending on your use case:
- Increase
min_turn_silence: when brief pauses cause the speculative EOT check to fire too early, ending turns on terminal punctuation before the user has finished speaking. - Increase
max_turn_silence: when the forced turn end is cutting off users mid-thought or splitting entities like phone numbers across turns, a higher value lets the model wait longer before forcing the turn to end when the model is unsure.
STT-based Turn Detection (recommended)
Withturn_detection="stt", AssemblyAI’s built-in punctuation-based turn detection determines when the user has finished speaking. AssemblyAI’s end_of_turn signals are then used directly by LiveKit to commit the turn.
In this mode, we recommend explicitly setting min_turn_silence=100 and max_turn_silence=1000. These are AssemblyAI’s API defaults and provide a good balance of responsiveness and accuracy.
The LiveKit plugin defaults to min_turn_silence=100 and max_turn_silence=100, which might be too aggressive for STT-based turn detection.
Recommended starting parameters (set on assemblyai.STT(), not on AgentSession):
| Parameter | Default | Description |
|---|---|---|
min_turn_silence | 100 ms | Silence duration before a speculative end-of-turn (EOT) check fires. |
max_turn_silence | 1000 ms | Maximum silence before a turn is forced to end. |
- User speaks → audio streams to AssemblyAI
- User pauses for
100ms→ AssemblyAI checks for terminal punctuation - If terminal punctuation (
.?!) → turn ends immediately - If no terminal punctuation → partial emitted, turn continues waiting
- If silence reaches
1000ms→ turn is forced to end regardless of punctuation
LiveKit turn detection (with MultilingualModel())
As a third-party turn detection model, LiveKit’s turn detector runs on top of STT output to make turn decisions. AssemblyAI’s role is then just to provide transcripts as quickly as possible, while the turn detection model decides when the user is actually done speaking.
Use MultilingualModel() rather than EnglishModel(), as Universal 3.5 Pro Realtime supports 18 languages. MultilingualModel() covers support for all of these languages.
The LiveKit plugin defaults of min_turn_silence=100 and max_turn_silence=100 work well here, as max_turn_silence is brought down to match min_turn_silence so that transcripts are handed off to the turn detection model as fast as possible.
MultilingualModel parameters (set inside TurnHandlingOptions → endpointing, not on the STT plugin):
| Parameter | Default | Description |
|---|---|---|
endpointing.min_delay | 0.5 s | Time to wait before committing a turn when the model predicts a likely boundary. |
endpointing.max_delay | 3.0 s | Maximum time to wait when the model predicts the user will continue speaking. Has no effect without a turn detector model. |
- User speaks → audio streams to AssemblyAI
- User pauses for
100ms→ AssemblyAI emits transcript (final and partial are the same) immediately - LiveKit’s
MultilingualModel()evaluates the transcript in conversational context - If the model predicts a likely turn boundary → waits
min_delay(0.5s) then commits the turn - If the model predicts the user will continue → waits up to
max_delay(3.0s) for more speech
Other turn detection modes
-
vad:- Detect end of turn from speech and silence data alone using Silero VAD.
- Turn boundaries are determined purely by voice activity without semantic context.
- AssemblyAI’s turn detection parameters still control when transcripts are emitted, but it is recommended to leave them at the plugin defaults (
min_turn_silence=100,max_turn_silence=100) so transcripts arrive as quickly as possible.
-
manual:- Disable automatic turn detection entirely.
- You control turns explicitly using
session.commit_user_turn(),session.clear_user_turn(), andsession.interrupt(). - See the manual turn control docs for details.
VAD configuration
Withturn_detection="stt", AssemblyAI also sends SpeechStarted events that LiveKit uses for barge-in/interruption handling. SpeechStarted is only emitted when the model produces a transcript.
Silero VAD is not strictly required in this mode, but it is still recommended as Silero runs locally and it can be faster than waiting for AssemblyAI’s SpeechStarted signal. LiveKit respects whichever signal arrives first, so Silero provides faster interruption while AssemblyAI’s signal serves as a reliable backup.
With MultilingualModel(), Silero VAD is required, as it is the only source of START_OF_SPEECH events for interruption in this mode. AssemblyAI’s SpeechStarted event is not used.
Threshold alignment
LiveKit’s Silero VAD defaults to anactivation_threshold of 0.5. AssemblyAI’s vad_threshold defaults to 0.3. For best performance, we recommend setting both to 0.3.
Both should be adjusted together to the same value to ensure accurate transcription and consistent barge-in thresholds.
When the thresholds are mismatched, you get a dead zone: if Silero is at 0.5 and AssemblyAI is at 0.3, AssemblyAI will be actively transcribing speech that LiveKit hasn’t detected yet, delaying interruption. Keeping them aligned eliminates this.
stt.vad_threshold and vad.activation_threshold thresholds together.
Entity splitting tradeoff
Lowermin_turn_silence and max_turn_silence values produce faster transcripts but can split entities or utterances across turns. The two parameters affect this differently.
min_turn_silence too low
- Speculative check fires too early, splitting entities on punctuation.
- Example: User spells out an email address with brief pauses between parts. The speculative check fires at 100ms of silence, and the model adds terminal punctuation to each segment, ending the turn prematurely.
max_turn_silence too low
- Forced turn-end cuts off user mid-thought.
- Example: User pauses longer than 1 second to think mid-sentence. The forced end fires at 1000ms, splitting the utterance into two turns regardless of punctuation.
LLMs downstream can usually piece together split entities, but if your use
case involves alphanumeric dictation or entity extraction, consider increasing
min_turn_silence and max_turn_silence during those portions of the
conversation. You can update configuration
mid-stream to raise max_turn_silence
temporarily (e.g., to 2000–4000 ms) when expecting entity input, then
lower it again afterward.min_turn_silence or max_turn_silence if users are likely to speak slowly or dictate entities. While this adds latency, it improves accuracy by giving the model more audio context before emitting a transcript and keeping the full entity complete within the same turn.
Latency
A voice agent feels responsive when the gap between the user finishing and the agent replying is short. Start with themode preset — the highest-level dial for the accuracy/latency trade-off. It sets sensible defaults for the fine-grained levers below, so you can pick a target and tune from there:
mode is set at construction time (it can’t be changed mid-session) and influences the defaults of min_turn_silence, vad_threshold, interruption_delay, continuous_partials, and previous_context_n_turns. Any value you set explicitly still wins. Leave it unset to use the server’s default preset. See Optimizing accuracy and latency.
From there, fine-tune the individual levers:
- Endpointing delay (STT mode). LiveKit’s
endpointing.min_delayis applied on top of AssemblyAI’s own end-of-turn timing and adds up to500msby default. Setendpointing={"min_delay": 0}when usingturn_detection="stt"— see STT-based turn detection. - End-of-turn timing.
min_turn_silence(speculative check) andmax_turn_silence(forced end) directly control how soon a turn ends. Lower is faster but risks splitting entities — see Turn detection. - Time to first partial.
interruption_delaycontrols how soon the first partial is emitted, which drives faster barge-in and speculative inference. The server adds a minimum of300mson top of the configured value.
- Sample rate. Use 16 kHz (
sample_rate=16000). Higher rates don’t improve accuracy and only add bandwidth. - Continuous partials.
continuous_partials(on by default) emits a partial every ~3 seconds during long turns. Leave it on for steady mid-turn updates, or disable it if you only need a single early partial. - Skip client-side preprocessing. Don’t run your own noise cancellation before audio reaches the model — the artifacts it introduces usually hurt accuracy more than the original noise. Use server-side Voice Focus instead.
Latency breakdown
| Stage | Typical | Controlled by |
|---|---|---|
| Network round trip | ~50 ms | — |
| Speech-to-text | ~200–300 ms | model |
| First partial (TTFT) | configured interruption_delay + ~300 ms server min | interruption_delay |
| End of turn (terminal punctuation found) | min_turn_silence (default 100 ms) | min_turn_silence |
| End of turn (no punctuation, forced) | up to max_turn_silence | max_turn_silence |
| LiveKit endpointing (STT mode) | + min_delay (set to 0) | endpointing.min_delay |
Accuracy
Universal 3.5 Pro Realtime is accurate out of the box. When you need more — domain vocabulary, proper nouns, noisy audio — reach for these levers. For entity-heavy dictation, also tune turn detection (see Entity splitting tradeoff), and note that the high-levelmode preset shifts the overall accuracy/latency balance (use max_accuracy to favor quality).
Prompting
Universal 3.5 Pro Realtime supports aprompt parameter for contextual prompting — a description of what the audio is about. Transcription behavior (verbatim output, punctuation, turn detection) is built in and optimized automatically; the prompt carries context, not instructions.
- Start with no prompt: Universal 3.5 Pro Realtime delivers strong accuracy out of the box — only add context if domain-specific vocabulary is being misrecognized.
- Describe the conversation: domain, scenario, or full details — start broad, and add only details your application actually knows (see the three context levels).
- Include known names and identifiers: caller name, account or order IDs, products — detailed context helps the model spell them correctly.
Key terms
Instead ofprompt, use keyterms_prompt to boost recognition of specific names, brands, or domain terms:
Conversation context
Give the model both sides of the dialog so it transcribes the next user turn more accurately. Universal 3.5 Pro Realtime keeps a short, per-session memory of the conversation from two sources:- The agent half — what your agent just said, which you push in via the
agent_contextparameter. - The user half — prior STT-finalized user turns, carried forward automatically (no configuration needed).
"What's your email address?", agent_context lets the model produce "user@assemblyai.com" instead of "user at assemblyai dot com". This has the biggest impact on short replies ("yes", "7pm", single names) and spelled-out entities. See Conversation context for the full reference.
| Parameter | Type | Description |
|---|---|---|
agent_context | string | Your agent’s most recent spoken reply (what your TTS just said), up to ~1500 characters. Set it at construction time to seed an opening greeting, and update it after each agent reply via update_options(). |
previous_context_n_turns | integer | How many prior conversation entries are carried forward automatically. Connect-time only. Range 0–100; 0 disables carryover; leave unset for the server default (3). |
agent_context and previous_context_n_turns are supported only on the U3 Pro family (universal-3-5-pro, u3-rt-pro). previous_context_n_turns is set at construction time and cannot be changed with update_options().Wiring agent context with conversation_item_added
Push your agent’s reply into the session whenever the agent finishes a turn. LiveKit emits a conversation_item_added event for every conversation item; filter for the assistant’s messages and forward the (trimmed) text to the STT plugin via update_options(agent_context=...). No reconnect is required.
A complete runnable example is on GitHub: including-agent-context.
Voice focus
Voice Focus isolates the primary speaker and suppresses background noise — chatter, keyboard clicks, fan hum, room echo — server-side, before audio reaches the model. Use it instead of client-side noise cancellation, which tends to introduce artifacts that hurt accuracy more than the noise itself.| Parameter | Type | Description |
|---|---|---|
voice_focus | string | "near-field" for headsets, handsets, and other close-talking mics; "far-field" for conference rooms, laptop mics, and other distant capture. |
voice_focus_threshold | float | Optional. 0.0–1.0; higher values suppress background audio more aggressively. |
update_options(). See Voice Focus for details.
Interruption handling
In voice agent conversations, users often produce backchannel utterances (“mhm”, “yeah”, “um”, “okay”) while the agent is speaking. These short fillers can trigger LiveKit’s interruption logic, causing the agent to stop mid-sentence even though the user didn’t intend to interrupt. LiveKit Cloud users should reach for adaptive interruption handling first. Self-hosted deployments can combine the two custom filters described below.Recommended: Adaptive interruption handling
LiveKit Agents v1.5.0 introduced adaptive interruption handling, a server-side model that classifies overlapping speech as a real interrupt or a backchannel. On a false interrupt the agent’s TTS resumes from where it left off. No re-generation is needed. The feature is recommended for LiveKit Cloud users and is included on all Cloud plans.Self-hosted alternative
If you’re self-hosting LiveKit, can’t use adaptive interruption handling, or need explicit control over which utterances count as filler, the two complementary filters below provide strong guardrails for interruption and barge-in handling. They have non-overlapping failure modes: the backchannel filter stops known fillers before they enter the pipeline, while the buffer-clearing filter catches any short utterance (including unknown fillers or stutters) that slips past. Running both provides the strongest coverage. A working reference implementation is available on GitHub.Upstream backchannel filter
Upstream backchannel filter
The backchannel filter intercepts STT events at the How it works:
stt_node level, before they reach LiveKit’s audio_recognition. It checks each transcript event for known disfluencies and backchannels (“um”, “mhm”, “yeah”, “okay”, etc.) and drops the event entirely. Because the event never enters the pipeline, none of the downstream orchestration (interrupt gates, end-of-turn detection, and preemptive LLM generation) ever reacts to it.The filter is implemented as a mixin class that wraps Agent.stt_node:- While the agent is speaking (plus a 1-second grace window after speech ends), the filter inspects each STT transcript event
- It strips punctuation and checks whether every token in the transcript matches the
BACKCHANNELSset - Pure-filler transcripts like “mhm” or “yeah okay” are dropped. They never reach LiveKit’s pipeline
- Utterances with any non-filler token (e.g., “yeah I want the suite”) always pass through
The
BACKCHANNELS set is domain-customizable. “yes” and “no” are deliberately excluded because they often represent genuine confirmations in booking or IVR scenarios. Edit the set for your use case.Short-utterance buffer clearing
Short-utterance buffer clearing
LiveKit accumulates committed
FINAL transcripts in a private buffer (_audio_transcript) across the user’s uncommitted turn. The _interrupt_by_audio_activity method checks the running word count against min_words to decide whether to pause TTS.Without intervention, two consecutive short fillers like “yeah” + “um” sum to two words and trip the interrupt gate, even though each utterance on its own is below threshold.This filter listens on the user_input_transcribed event and wipes the buffer whenever the agent is speaking and the user input falls below the configured min_words threshold. Each short utterance is evaluated independently rather than against the accumulated total.Wiring both filters into your agent
Wiring both filters into your agent
Three changes are needed:1. Mix in the backchannel filter on your agent class. Place it before 2. Configure the session with 3. Install the buffer-clearing filter on the session, then start it with your agent:Expected behavior with both filters active and
Agent in the class bases so its stt_node runs first:interruption.min_words >= 2. This enables the word-count gate that the buffer-clearing filter depends on:min_words=2:| User utterance (during agent speech) | Backchannel filter | Buffer-clearing filter | Result |
|---|---|---|---|
| ”mhm” | Dropped | - | Agent continues |
| ”um” | Dropped | - | Agent continues |
| ”yeah yeah” | Dropped | - | Agent continues |
| Unknown short word | Passes through | Buffer cleared | Agent continues |
| ”yeah I’d like the suite” | Passes through | Passes through | Agent interrupts |
| ”suite please” | Passes through | Passes through | Agent interrupts |
| ”mhm” (1+ second after agent stops) | Passes through | - | Agent responds |
Dynamic configuration
You can updateprompt, keyterms_prompt, agent_context, min_turn_silence, max_turn_silence, continuous_partials, and interruption_delay during an active session using update_options — no reconnect required. This lets you adapt to the conversation stage: boost names while collecting caller details, widen silence windows while a user dictates an email, then tighten them again.
| Conversation stage | Adjustment |
|---|---|
| Caller identification (names, account IDs) | Boost terms with update_options(keyterms_prompt=[...]) |
| Entity dictation (email, phone, address) | Raise max_turn_silence to ~2000–4000 ms, then lower it again afterward |
| After each agent reply | Push agent_context (see Conversation context) |
| Faster barge-in | Lower interruption_delay (see Latency) |
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
Extra latency with turn_detection="stt" | LiveKit’s endpointing min_delay is additive in STT mode | Set endpointing={"min_delay": 0} inside TurnHandlingOptions on AgentSession |
| No interruption handling | Missing VAD | Ensure vad=silero.VAD is set, with activation_threshold equal to vad_threshold (default 0.3) |
| Turn over-segmentation | min_turn_silence too low | Increase from 100 to 200–500 |
| Entities split across turns | max_turn_silence too low | Increase max_turn_silence (e.g., 1500–3500) |
| Latency on non-terminal utterances | max_turn_silence too high | Lower max_turn_silence |
| Mis-heard names, brands, or jargon | No vocabulary hints | Add keyterms_prompt, or supply prompt/agent_context for context |
| Poor accuracy in noisy audio | Background noise or room echo | Enable voice_focus (near-field or far-field) |
Migrating from another STT provider
To balance accuracy, latency, turn-taking, and interruption handling, map your current setup to AssemblyAI using the questions below. Each answer points to the settings that reproduce — and usually improve on — your current behavior.How are you detecting end-of-turn today?
| Today | Recommended on AssemblyAI |
|---|---|
| Your STT provider’s own end-of-turn model (e.g. Deepgram endpointing) | turn_detection="stt" with min_turn_silence=100, max_turn_silence=1000, endpointing={"min_delay": 0}. AssemblyAI’s punctuation-based end-of-turn replaces it. |
| Silence / VAD only | Prefer turn_detection="stt" for semantic endpointing, or turn_detection="vad" to stay silence-only. Align Silero activation_threshold and vad_threshold at 0.3. |
| LiveKit’s turn-detector model | turn_detection=MultilingualModel(), keep the plugin defaults (min_turn_silence=100, max_turn_silence=100), set endpointing={"min_delay": 0.5, "max_delay": 3.0}, and keep Silero VAD (required). |
Which model and settings are you migrating from?
| What you pass today | AssemblyAI equivalent |
|---|---|
| Current model (Deepgram, ElevenLabs, etc.) | model="universal-3-5-pro" (preview) or "u3-rt-pro" (GA) |
| Overall accuracy/latency tuning | mode="min_latency" / "balanced" / "max_accuracy" — a one-line starting point before fine-tuning |
| Endpointing / silence thresholds | min_turn_silence (speculative end-of-turn) and max_turn_silence (forced end) — see Turn detection |
| Custom vocabulary / keywords | keyterms_prompt=[...]; broader domain context → prompt |
| Formatting / punctuation toggles | On by default — formatted transcripts always (format_turns does not apply) |
| Language(s) | 18 languages with code-switching; pin a language via prompt; language_detection adds language codes |
VAD, interruptions, and infrastructure
| Today | Recommended |
|---|---|
| Silero VAD | Keep it; set activation_threshold=0.3 to match vad_threshold |
| Adaptive vs. VAD-based barge-in | LiveKit Cloud → adaptive interruption handling; self-hosted → the self-hosted filters |
| Telephony / SIP routing | sample_rate=8000 and encoding="pcm_mulaw" for 8 kHz telephony |
| Client-side noise cancellation | Drop it; use server-side Voice Focus instead |
| LiveKit Cloud vs. self-hosting | Cloud unlocks adaptive interruption; self-hosting uses the filters and version pinning |
Parameter cheat sheet
If you’re coming from AssemblyAI’s standard streaming model:| Change | From | To |
|---|---|---|
| Model | assemblyai.STT() | assemblyai.STT(model="universal-3-5-pro") |
| Turn detection | turn_detection="stt" or EnglishModel() | turn_detection="stt" or MultilingualModel() |
| VAD | Optional | Set vad=silero.VAD.load() to match vad_threshold |
min_turn_silence | 400 (old default) | 100 (new default) |
max_turn_silence | 1280 (old default) | 1000 (API default) or 100 (with 3rd-party turn detector) |
end_of_turn_confidence_threshold | Configurable | Not applicable: Universal 3.5 Pro Realtime uses punctuation-based turn detection |
endpointing.min_delay (formerly min_endpointing_delay) | Default 0.5 | Set to 0 inside TurnHandlingOptions when using turn_detection="stt" |