Universal 3.5 Pro Realtime on LiveKit

Overview

This guide covers integrating AssemblyAI’s Universal 3.5 Pro Realtime speech-to-text model into a LiveKit voice agent using the Agents framework.

Universal 3.5 Pro Realtime is our flagship next-generation streaming model for voice agents — multilingual and promptable, with conversation context and voice focus.Available on livekit-agents 1.6+ — set model="universal-3-5-pro".

AssemblyAI provides the speech-to-text and turn detection in your LiveKit pipeline: Once you have an agent running, tune it for what matters most to your use case:

Turn detection

Decide when the user is done speaking — modes, defaults, and entity tuning.

Latency

Shorten the gap between the user finishing and the agent replying.

Accuracy

Prompting, key terms, conversation context, and noise handling.

Interruptions

Natural barge-in without false triggers from backchannels.

Quickstart

Get a working, talking agent in a few minutes, then optimize from there.

Install the SDK

Install the plugin and supporting packages (silero, codecs, dotenv) from PyPI:

pip install "livekit-agents[assemblyai,silero,codecs]~=1.6" \
    python-dotenv

If you plan to use LiveKit turn detection with MultilingualModel(), also install the turn detector plugin:

pip install "livekit-plugins-turn-detector~=1.0"

Install the latest livekit-agents from PyPI. Universal 3.5 Pro Realtime, agent_context, and Voice Focus require a recent 1.6 release. Older versions won’t recognize the universal-3-5-pro model and return a validation error.

Set your API keys

Set your API keys in a .env file:

LIVEKIT_URL=wss://your-project.livekit.cloud
LIVEKIT_API_KEY=your_livekit_api_key
LIVEKIT_API_SECRET=your_livekit_api_secret
ASSEMBLYAI_API_KEY=your_assemblyai_key
# Add API keys for your chosen LLM and TTS providers

You can obtain an AssemblyAI API key by signing up for a free account and navigating to the API Keys tab of the dashboard.

Build a minimal agent

The following example uses turn_detection="stt" (recommended). Pay close attention to the comments for using with MultilingualModel().

from dotenv import load_dotenv
from livekit import agents
from livekit.agents import AgentSession, Agent, TurnHandlingOptions
from livekit.plugins import (
    assemblyai,
    silero,
)
# For MultilingualModel, uncomment the following:
# from livekit.plugins.turn_detector.multilingual import MultilingualModel

load_dotenv()


class Assistant(Agent):
    def __init__(self) -> None:
        super().__init__(instructions="You are a helpful voice AI assistant.")


async def entrypoint(ctx: agents.JobContext):
    await ctx.connect()

    session = AgentSession(
        stt=assemblyai.STT(
            model="universal-3-5-pro",
            min_turn_silence=100,
            max_turn_silence=1000,  # When turn_detection="stt", override plugin default of 100.
            # If using MultilingualModel(), the plugin defaults (min: 100, max: 100) work well. Omit min_turn_silence and max_turn_silence above if preferred.
            vad_threshold=0.3,  # Match Silero's activation_threshold
            # continuous_partials is True by default in the LiveKit plugin — steady ~3s partials during long turns.
            # interruption_delay=0,  # Optional: faster first partial (~300ms effective). Default: 500 (~800ms effective).
        ),
        # llm=your_llm_plugin(),  # Add your LLM provider here
        # tts=your_tts_plugin(),  # Add your TTS provider here
        vad=silero.VAD.load(
            activation_threshold=0.3,  # Match AssemblyAI's internal VAD threshold
        ),
        turn_handling=TurnHandlingOptions(
            turn_detection="stt",
            # To use LiveKit's turn detection instead, replace the line above with:
            # turn_detection=MultilingualModel(),
            endpointing={"min_delay": 0},  # Avoid additive delay in STT mode
            # If using MultilingualModel(), set these instead:
            # endpointing={"min_delay": 0.5, "max_delay": 3.0},
        ),
    )

    await session.start(
        room=ctx.room,
        agent=Assistant(),
    )

    await session.generate_reply(
        instructions="Greet the user and offer your assistance."
    )


if __name__ == "__main__":
    agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))

For a complete voice agent, you will also need to install LLM and TTS plugins for your chosen providers. See the LiveKit plugins documentation for available options.

Run and test

Start your agent in development mode:

python your_agent_file.py dev

Then test it in the LiveKit Playground:

Go to agents-playground.livekit.io
Connect to your LiveKit Cloud project (same credentials as your .env)
Click Connect: a room will be created, your agent will join, and you can start talking

Parameters reference

Universal 3.5 Pro Realtime parameters

These are the key parameters to tune for LiveKit when using Universal 3.5 Pro Realtime:

model

string

default:"universal-3-5-pro"

The streaming model. Defaults to "universal-3-5-pro", our recommended flagship model.

mode

string

Accuracy/latency preset: "min_latency", "balanced", or "max_accuracy". Sets the defaults for mode-dependent fields (min_turn_silence, vad_threshold, interruption_delay, continuous_partials, previous_context_n_turns); any value you set explicitly still takes precedence. Leave unset for the server’s default preset. Connect-time only — cannot be changed with update_options. Universal-3.5 Pro only. See Optimizing accuracy and latency.

keyterms_prompt

list of strings

List of terms to boost recognition for. Applied automatically alongside any contextual prompt.

prompt

string

Contextual prompt — a natural-language description of what the audio is about (domain, scenario, or full details). Transcription behavior is built in and optimized automatically.

agent_context

string

Your agent’s most recent spoken reply, up to ~1500 characters, used as context for transcribing the next user turn. Can be set at construction time and updated mid-stream with update_options. Universal-3.5 Pro only. See Conversation context.

previous_context_n_turns

integer

How many prior conversation entries are carried forward automatically. Range 0–100; 0 disables carryover. Leave unset for the server default (5). Connect-time only — cannot be changed with update_options. Universal-3.5 Pro only.

agent_context_carryover

boolean

default:"false"

When true, the plugin automatically forwards your agent’s replies to the STT model as agent_context after each turn — no conversation_item_added event wiring required. Requires livekit-agents 1.6.5+. Universal-3.5 Pro only. See Conversation context.

min_turn_silence

integer

default:"100"

Milliseconds of silence before a speculative end-of-turn check. When the check fires, the model looks for terminal punctuation to decide whether the turn has ended.

max_turn_silence

integer

default:"100"

Maximum milliseconds of silence before the turn is forced to end, regardless of punctuation. The LiveKit plugin defaults to 100. Set to 1000 when using turn_detection="stt".

vad_threshold

float

default:"0.3"

AssemblyAI’s internal Silero VAD threshold. Universal 3.5 Pro Realtime defaults to 0.3, unlike Universal-Streaming’s 0.4. Align with LiveKit’s Silero activation_threshold for consistent behavior.

voice_focus

string

Server-side noise suppression that isolates the primary speaker. "near-field" for close-talking mics, "far-field" for distant capture. Connect-time only. Universal-3.5 Pro only. See Voice focus.

voice_focus_threshold

float

How aggressively voice_focus suppresses background audio. 0.0–1.0; higher is more aggressive. Connect-time only. Universal-3.5 Pro only.

continuous_partials

boolean

default:"true"

Whether to emit additional partial transcripts during long turns at a steady ~3 second cadence. When enabled (default on both the API and the LiveKit plugin), additional partials covering the full turn transcript are emitted approximately every 3 seconds while speech continues. When disabled, only one early partial is emitted near turn start. The first partial (at 750ms) is unaffected. Useful when downstream consumers (LLMs, UI, eager inference) need frequent updates during long, uninterrupted turns. See Continuous partials for details.

interruption_delay

integer

default:"500"

How soon the first partial transcript is emitted during a turn, in milliseconds. Range: 0–1000. Lower values produce faster time to first token (TTFT) for barge-in and speculative inference; higher values produce more confident first partials. The server adds a minimum of 300ms on top of the configured value (interruption_delay: 0 → ~300ms effective, interruption_delay: 500 → ~800ms effective). See Tuning early partial timing for details.

language_detection

boolean

default:"true"

Universal 3.5 Pro Realtime code-switches natively between supported languages. This parameter controls whether language_code and language_confidence are included in turn messages. Defaults to true in the LiveKit plugin, but false when using the API directly.

General STT parameters

These parameters apply to all AssemblyAI streaming models and can remain the same between models:

sample_rate

int

default:"16000"

The sample rate of the audio stream.

encoding

str

default:"pcm_s16le"

The encoding of the audio stream. Allowed values: pcm_s16le, pcm_mulaw.

Legacy parameters

These parameters apply to the universal-streaming-english and universal-streaming-multilingual AssemblyAI streaming models, but do not affect Universal 3.5 Pro Realtime:

end_of_turn_confidence_threshold

float

default:"0.4"

Confidence threshold for end-of-turn detection. Universal 3.5 Pro Realtime uses punctuation-based turn detection instead.

format_turns

boolean

default:"false"

Whether to return formatted final transcripts. Universal 3.5 Pro Realtime always returns formatted transcripts, so this parameter no longer applies.

Turn detection

In LiveKit, how your agent detects the end of a user’s turn is controlled by the turn_detection parameter inside TurnHandlingOptions, which is passed to AgentSession via the turn_handling argument. Universal 3.5 Pro Realtime uses a punctuation-based turn detection system, which checks for terminal punctuation (. ? !) after periods of silence rather than using a confidence score. This means the min_turn_silence and max_turn_silence parameters you pass to AssemblyAI directly control when transcripts are emitted and when turns end. For more details on how this works, see Configuring turn detection.

When not explicitly provided, the default endpointing parameters for Universal 3.5 Pro Realtime differ on LiveKit versus using AssemblyAI’s API directly:

LiveKit AssemblyAI plugin defaults:
- min_turn_silence=100
- max_turn_silence=100
AssemblyAI API defaults:
- min_turn_silence=100
- max_turn_silence=1000

However, you can always override these by passing your own preferred values explicitly. Misconfiguring these parameters is the most common cause of poor performance — read the recommended values per mode below.

Default parameter differences

Universal 3.5 Pro Realtime’s endpointing is controlled by two AssemblyAI API parameters, min_turn_silence and max_turn_silence, that you pass to the STT plugin. These are separate from LiveKit’s endpointing.min_delay and endpointing.max_delay (set inside TurnHandlingOptions).

Parameter	AssemblyAI API default	LiveKit plugin default	Description
`min_turn_silence`	`100` ms	`100` ms	Silence before a speculative end-of-turn check. If terminal punctuation (`.` `?` `!`) is found, the turn ends. If not, a partial is emitted and the turn continues.
`max_turn_silence`	`1000` ms	`100` ms	Maximum silence before forcing the turn to end, regardless of punctuation.
`continuous_partials`	`true`	`true`	Emit partial transcripts every ~3 seconds during long turns for steady mid-turn updates to downstream consumers. Enabled by default on both the API and the LiveKit plugin.

The LiveKit plugin defaults are optimized for third-party turn detection models, where you want transcripts handed off as fast as possible. When using turn_detection="stt", you should explicitly set max_turn_silence=1000 if you’d like to mimic the behavior of streaming directly to the API without LiveKit.

Tuning endpointing parametersThese are the default values used when no parameters are explicitly provided. You will likely need to experiment with different values depending on your use case:

Increase min_turn_silence: when brief pauses cause the speculative EOT check to fire too early, ending turns on terminal punctuation before the user has finished speaking.
Increase max_turn_silence: when the forced turn end is cutting off users mid-thought or splitting entities like phone numbers across turns, a higher value lets the model wait longer before forcing the turn to end when the model is unsure.

See the Entity splitting tradeoff section for examples.

STT-based Turn Detection (recommended)

With turn_detection="stt", AssemblyAI’s built-in punctuation-based turn detection determines when the user has finished speaking. AssemblyAI’s end_of_turn signals are then used directly by LiveKit to commit the turn. In this mode, we recommend explicitly setting min_turn_silence=100 and max_turn_silence=1000. These are AssemblyAI’s API defaults and provide a good balance of responsiveness and accuracy. The LiveKit plugin defaults to min_turn_silence=100 and max_turn_silence=100, which might be too aggressive for STT-based turn detection. Recommended starting parameters (set on assemblyai.STT(), not on AgentSession):

Parameter	Default	Description
`min_turn_silence`	`100` ms	Silence duration before a speculative end-of-turn (EOT) check fires.
`max_turn_silence`	`1000` ms	Maximum silence before a turn is forced to end.

How it works:

User speaks → audio streams to AssemblyAI
User pauses for 100ms → AssemblyAI checks for terminal punctuation
If terminal punctuation (. ? !) → turn ends immediately
If no terminal punctuation → partial emitted, turn continues waiting
If silence reaches 1000ms → turn is forced to end regardless of punctuation

Endpointing min_delay is additive in STT modeLiveKit’s endpointing min_delay (default 0.5 seconds) is applied on top of AssemblyAI’s own endpointing. In STT mode, this delay starts after the STT end-of-speech signal, meaning it adds up to 500ms of extra latency by default.Set endpointing={"min_delay": 0} inside TurnHandlingOptions to avoid this. AssemblyAI’s own endpointing parameters (min_turn_silence and max_turn_silence) already control the timing, so an additional delay on the LiveKit side is unnecessary latency. See Latency for the full picture.

from livekit.agents import AgentSession, TurnHandlingOptions

session = AgentSession(
    turn_handling=TurnHandlingOptions(
        turn_detection="stt",
        endpointing={"min_delay": 0},  # Avoid additive delay in STT mode
    ),
    stt=assemblyai.STT(
        model="universal-3-5-pro",
        min_turn_silence=100,   # Silence (ms) before a speculative end-of-turn check
        max_turn_silence=1000,  # Max silence (ms) before forcing the turn to end
        vad_threshold=0.3,
    ),
    vad=silero.VAD.load(
        activation_threshold=0.3,
    ),
)

LiveKit turn detection (with `MultilingualModel()`)

As a third-party turn detection model, LiveKit’s turn detector runs on top of STT output to make turn decisions. AssemblyAI’s role is then just to provide transcripts as quickly as possible, while the turn detection model decides when the user is actually done speaking. Use MultilingualModel() rather than EnglishModel(), as Universal 3.5 Pro Realtime supports 18 languages. MultilingualModel() covers support for all of these languages. The LiveKit plugin defaults of min_turn_silence=100 and max_turn_silence=100 work well here, as max_turn_silence is brought down to match min_turn_silence so that transcripts are handed off to the turn detection model as fast as possible. MultilingualModel parameters (set inside TurnHandlingOptions → endpointing, not on the STT plugin):

Parameter	Default	Description
`endpointing.min_delay`	`0.5` s	Time to wait before committing a turn when the model predicts a likely boundary.
`endpointing.max_delay`	`3.0` s	Maximum time to wait when the model predicts the user will continue speaking. Has no effect without a turn detector model.

How it works:

User speaks → audio streams to AssemblyAI
User pauses for 100ms → AssemblyAI emits transcript (final and partial are the same) immediately
LiveKit’s MultilingualModel() evaluates the transcript in conversational context
If the model predicts a likely turn boundary → waits min_delay (0.5s) then commits the turn
If the model predicts the user will continue → waits up to max_delay (3.0s) for more speech

from livekit.agents import AgentSession, TurnHandlingOptions
from livekit.plugins.turn_detector.multilingual import MultilingualModel

session = AgentSession(
    turn_handling=TurnHandlingOptions(
        turn_detection=MultilingualModel(),
        endpointing={
            "min_delay": 0.5,  # Time (s) to wait before committing a turn when the model is confident
            "max_delay": 3.0,  # Max time (s) to wait when the model is not confident
        },
    ),
    stt=assemblyai.STT(
        model="universal-3-5-pro",
        vad_threshold=0.3,
    ),
    vad=silero.VAD.load(
        activation_threshold=0.3,
    ),
)

Other turn detection modes

vad:
- Detect end of turn from speech and silence data alone using Silero VAD.
- Turn boundaries are determined purely by voice activity without semantic context.
- AssemblyAI’s turn detection parameters still control when transcripts are emitted, but it is recommended to leave them at the plugin defaults (min_turn_silence=100, max_turn_silence=100) so transcripts arrive as quickly as possible.
manual:
- Disable automatic turn detection entirely.
- You control turns explicitly using session.commit_user_turn(), session.clear_user_turn(), and session.interrupt().
- See the manual turn control docs for details.

VAD configuration

With turn_detection="stt", AssemblyAI also sends SpeechStarted events that LiveKit uses for barge-in/interruption handling. SpeechStarted is only emitted when the model produces a transcript. Silero VAD is not strictly required in this mode, but it is still recommended as Silero runs locally and it can be faster than waiting for AssemblyAI’s SpeechStarted signal. LiveKit respects whichever signal arrives first, so Silero provides faster interruption while AssemblyAI’s signal serves as a reliable backup. With MultilingualModel(), Silero VAD is required, as it is the only source of START_OF_SPEECH events for interruption in this mode. AssemblyAI’s SpeechStarted event is not used.

Threshold alignment

LiveKit’s Silero VAD defaults to an activation_threshold of 0.5. AssemblyAI’s vad_threshold defaults to 0.3. For best performance, we recommend setting both to 0.3. Both should be adjusted together to the same value to ensure accurate transcription and consistent barge-in thresholds. When the thresholds are mismatched, you get a dead zone: if Silero is at 0.5 and AssemblyAI is at 0.3, AssemblyAI will be actively transcribing speech that LiveKit hasn’t detected yet, delaying interruption. Keeping them aligned eliminates this.

session = AgentSession(
    stt=assemblyai.STT(
        model="universal-3-5-pro",
        vad_threshold=0.3,         # AssemblyAI's internal VAD onset
    ),
    vad=silero.VAD.load(
        activation_threshold=0.3,  # Match AssemblyAI's threshold
    ),
)

If you’re in a noisy environment and receiving false speech triggers, raise both stt.vad_threshold and vad.activation_threshold thresholds together.

Entity splitting tradeoff

Lower min_turn_silence and max_turn_silence values produce faster transcripts but can split entities or utterances across turns. The two parameters affect this differently.

`min_turn_silence` too low

Speculative check fires too early, splitting entities on punctuation.
Example: User spells out an email address with brief pauses between parts. The speculative check fires at 100ms of silence, and the model adds terminal punctuation to each segment, ending the turn prematurely.

# With (min_turn_silence=100, max_turn_silence=1000)
"It's John."                    → FINAL (100ms pause, check fires, period found → turn ends)
"Smith."                        → FINAL
"At gmail.com."                 → FINAL

# With (min_turn_silence=400, max_turn_silence=1000)
"It's john.smith@gmail.com."    → FINAL (single turn, properly formatted)

`max_turn_silence` too low

Forced turn-end cuts off user mid-thought.
Example: User pauses longer than 1 second to think mid-sentence. The forced end fires at 1000ms, splitting the utterance into two turns regardless of punctuation.

# With (min_turn_silence=100, max_turn_silence=1000)
"I wanted to check on my order from..."  → FINAL (1000ms silence, forced end)
"last Tuesday, order number 4829."     → FINAL (new turn)

# With (min_turn_silence=100, max_turn_silence=2000)
"I wanted to check on my order from last Tuesday, order number 4829."  → FINAL (single turn)

Universal 3.5 Pro Realtime’s formatting is significantly better when it has full context in a single turn. Email addresses, phone numbers, credit card numbers, and physical addresses all benefit from this.

LLMs downstream can usually piece together split entities, but if your use case involves alphanumeric dictation or entity extraction, consider increasing min_turn_silence and max_turn_silence during those portions of the conversation. You can update configuration mid-stream to raise max_turn_silence temporarily (e.g., to 2000–4000 ms) when expecting entity input, then lower it again afterward.

Even when using third-party turn detection, you may want to increase min_turn_silence or max_turn_silence if users are likely to speak slowly or dictate entities. While this adds latency, it improves accuracy by giving the model more audio context before emitting a transcript and keeping the full entity complete within the same turn.

Latency

A voice agent feels responsive when the gap between the user finishing and the agent replying is short. Start with the mode preset — the highest-level dial for the accuracy/latency trade-off. It sets sensible defaults for the fine-grained levers below, so you can pick a target and tune from there:

stt=assemblyai.STT(
    model="universal-3-5-pro",
    mode="balanced",  # "min_latency" (fastest) · "balanced" · "max_accuracy" (best quality)
)

mode is set at construction time (it can’t be changed mid-session) and influences the defaults of min_turn_silence, vad_threshold, interruption_delay, continuous_partials, and previous_context_n_turns. Any value you set explicitly still wins. Leave it unset to use the server’s default preset. See Optimizing accuracy and latency. From there, fine-tune the individual levers:

Endpointing delay (STT mode). LiveKit’s endpointing.min_delay is applied on top of AssemblyAI’s own end-of-turn timing and adds up to 500ms by default. Set endpointing={"min_delay": 0} when using turn_detection="stt" — see STT-based turn detection.
End-of-turn timing. min_turn_silence (speculative check) and max_turn_silence (forced end) directly control how soon a turn ends. Lower is faster but risks splitting entities — see Turn detection.
Time to first partial. interruption_delay controls how soon the first partial is emitted, which drives faster barge-in and speculative inference. The server adds a minimum of 300ms on top of the configured value.

# Tune first-partial timing for faster barge-in
stt.update_options(
    interruption_delay=0,  # ~300ms effective TTFT
)

Sample rate. Use 16 kHz (sample_rate=16000). Higher rates don’t improve accuracy and only add bandwidth.
Continuous partials. continuous_partials (on by default) emits a partial every ~3 seconds during long turns. Leave it on for steady mid-turn updates, or disable it if you only need a single early partial.
Skip client-side preprocessing. Don’t run your own noise cancellation before audio reaches the model — the artifacts it introduces usually hurt accuracy more than the original noise. Use server-side Voice Focus instead.

Latency breakdown

Stage	Typical	Controlled by
Network round trip	~50 ms	—
Speech-to-text	~200–300 ms	model
First partial (TTFT)	configured `interruption_delay` + ~300 ms server min	`interruption_delay`
End of turn (terminal punctuation found)	`min_turn_silence` (default 100 ms)	`min_turn_silence`
End of turn (no punctuation, forced)	up to `max_turn_silence`	`max_turn_silence`
LiveKit endpointing (STT mode)	+ `min_delay` (set to `0`)	`endpointing.min_delay`

Accuracy

Universal 3.5 Pro Realtime is accurate out of the box. When you need more — domain vocabulary, proper nouns, noisy audio — reach for these levers. For entity-heavy dictation, also tune turn detection (see Entity splitting tradeoff), and note that the high-level mode preset shifts the overall accuracy/latency balance (use max_accuracy to favor quality).

Prompting

Beta featurePrompting is considered a beta feature for Universal 3.5 Pro Realtime.While it can be a powerful tool for improving accuracy in certain use cases, we recommend starting without a prompt to first establish baseline performance.Once the baseline has been tested, you can add context to further optimize for your use case (e.g., language mix to expect (e.g., English and Hindi), use case or domain (e.g., medical, legal), etc.).

Universal 3.5 Pro Realtime supports a prompt parameter for contextual prompting — a description of what the audio is about. Transcription behavior (verbatim output, punctuation, turn detection) is built in and optimized automatically; the prompt carries context, not instructions.

stt=assemblyai.STT(
    model="universal-3-5-pro",
    prompt="Customer support call about an internet service outage.",
)

Tips:

Start with no prompt: Universal 3.5 Pro Realtime delivers strong accuracy out of the box — only add context if domain-specific vocabulary is being misrecognized.
Describe the conversation: domain, scenario, or full details — start broad, and add only details your application actually knows (see the three context levels).
Include known names and identifiers: caller name, account or order IDs, products — detailed context helps the model spell them correctly.

Key terms

Instead of prompt, use keyterms_prompt to boost recognition of specific names, brands, or domain terms:

stt=assemblyai.STT(
    model="universal-3-5-pro",
    keyterms_prompt=["AssemblyAI", "LiveKit", "Universal 3.5 Pro Realtime"],
)

Conversation context

Give the model both sides of the dialog so it transcribes the next user turn more accurately. Universal 3.5 Pro Realtime keeps a short, per-session memory of the conversation from two sources:

The agent half — what your agent just said, which you push in via the agent_context parameter.
The user half — prior STT-finalized user turns, carried forward automatically (no configuration needed).

With the agent’s question in context, the model can anticipate the answer, sharpen entity recognition, and disambiguate similar-sounding words. For example, after your agent asks "What's your email address?", agent_context lets the model produce "user@assemblyai.com" instead of "user at assemblyai dot com". This has the biggest impact on short replies ("yes", "7pm", single names) and spelled-out entities. See Conversation context for the full reference.

Parameter	Type	Description
`agent_context_carryover`	boolean	When `true`, the plugin automatically forwards your agent’s replies to the STT model as `agent_context` after each turn — no event wiring required. Requires `livekit-agents` 1.6.5+. Defaults to `false`.
`agent_context`	string	Your agent’s most recent spoken reply (what your TTS just said), up to ~1500 characters. Set it at construction time to seed an opening greeting, and update it after each agent reply via `update_options()`.
`previous_context_n_turns`	integer	How many prior conversation entries are carried forward automatically. Connect-time only. Range `0`–`100`; `0` disables carryover; leave unset for the server default (5).

agent_context_carryover, agent_context, and previous_context_n_turns are supported only on Universal-3.5 Pro (universal-3-5-pro). previous_context_n_turns is set at construction time and cannot be changed with update_options().

Wiring agent context with `agent_context_carryover` (recommended)

On livekit-agents 1.6.5+, set agent_context_carryover=True on assemblyai.STT(...) and the plugin forwards your agent’s replies to the STT model as agent_context automatically after each turn. No conversation_item_added handler required.

from livekit.agents import AgentSession
from livekit.plugins import assemblyai

session = AgentSession(
    stt=assemblyai.STT(
        model="universal-3-5-pro",
        agent_context_carryover=True,  # Requires livekit-agents 1.6.5+
    ),
    # llm=your_llm_plugin(),
    # tts=your_tts_plugin(),
)

To seed the model with your agent’s opening greeting, also pass agent_context directly on assemblyai.STT(...) at construction time. agent_context_carryover will keep it current after each subsequent reply.

Wiring agent context with `conversation_item_added` (pre-1.6.5)

If you’re on a version of livekit-agents that predates 1.6.5, wire agent context yourself. LiveKit emits a conversation_item_added event for every conversation item; filter for the assistant’s messages and forward the (trimmed) text to the STT plugin via update_options(agent_context=...). No reconnect is required. A complete runnable example is on GitHub: including-agent-context.

from livekit.agents import AgentSession, ConversationItemAddedEvent
from livekit.plugins import assemblyai

AGENT_CONTEXT_MAX_CHARS = 1500

session = AgentSession(
    stt=assemblyai.STT(
        model="universal-3-5-pro",
    ),
    # llm=your_llm_plugin(),
    # tts=your_tts_plugin(),
)


@session.on("conversation_item_added")
def _on_conversation_item_added(ev: ConversationItemAddedEvent) -> None:
    # Only forward the agent's own replies as context for the next user turn.
    if ev.item.type != "message" or ev.item.role != "assistant":
        return

    agent_stt = session.stt
    if not isinstance(agent_stt, assemblyai.STT):
        return

    spoken = ev.item.text_content
    if not spoken:
        return

    # Push the latest reply; earlier turns are carried forward automatically.
    agent_stt.update_options(agent_context=spoken[-AGENT_CONTEXT_MAX_CHARS:])

Voice focus

Voice Focus isolates the primary speaker and suppresses background noise — chatter, keyboard clicks, fan hum, room echo — server-side, before audio reaches the model. Use it instead of client-side noise cancellation, which tends to introduce artifacts that hurt accuracy more than the noise itself.

Parameter	Type	Description
`voice_focus`	string	`"near-field"` for headsets, handsets, and other close-talking mics; `"far-field"` for conference rooms, laptop mics, and other distant capture.
`voice_focus_threshold`	float	Optional. `0.0`–`1.0`; higher values suppress background audio more aggressively.

Both are connection-time parameters on the Universal-3.5 Pro and cannot be changed with update_options(). See Voice Focus for details.

stt=assemblyai.STT(
    model="universal-3-5-pro",
    voice_focus="far-field",      # "near-field" for close-talking mics
    voice_focus_threshold=0.5,    # Optional: 0.0–1.0, higher = more aggressive
)

Interruption handling

In voice agent conversations, users often produce backchannel utterances (“mhm”, “yeah”, “um”, “okay”) while the agent is speaking. These short fillers can trigger LiveKit’s interruption logic, causing the agent to stop mid-sentence even though the user didn’t intend to interrupt. LiveKit Cloud users should reach for adaptive interruption handling first. Self-hosted deployments can combine the two custom filters described below.

Recommended: Adaptive interruption handling

LiveKit Agents v1.5.0 introduced adaptive interruption handling, a server-side model that classifies overlapping speech as a real interrupt or a backchannel. On a false interrupt the agent’s TTS resumes from where it left off. No re-generation is needed. The feature is recommended for LiveKit Cloud users and is included on all Cloud plans.

session = AgentSession(
    stt=assemblyai.STT(model="universal-3-5-pro"),
    # llm=your_llm_plugin(),
    # tts=your_tts_plugin(),
    vad=silero.VAD.load(),
    interruption={"mode": "adaptive"},  # default in v1.5.0+
)

See LiveKit’s adaptive interruption handling docs for full requirements and behavior.

Self-hosted alternative

If you’re self-hosting LiveKit, can’t use adaptive interruption handling, or need explicit control over which utterances count as filler, the two complementary filters below provide strong guardrails for interruption and barge-in handling. They have non-overlapping failure modes: the backchannel filter stops known fillers before they enter the pipeline, while the buffer-clearing filter catches any short utterance (including unknown fillers or stutters) that slips past. Running both provides the strongest coverage. A working reference implementation is available on GitHub.

Upstream backchannel filter

The backchannel filter intercepts STT events at the stt_node level, before they reach LiveKit’s audio_recognition. It checks each transcript event for known disfluencies and backchannels (“um”, “mhm”, “yeah”, “okay”, etc.) and drops the event entirely. Because the event never enters the pipeline, none of the downstream orchestration (interrupt gates, end-of-turn detection, and preemptive LLM generation) ever reacts to it.The filter is implemented as a mixin class that wraps Agent.stt_node:

from __future__ import annotations

import logging
import string
import time
from collections.abc import AsyncIterable

from livekit import rtc
from livekit.agents import Agent, stt
from livekit.agents.voice import ModelSettings


# "yes" / "no" deliberately omitted - in a booking flow a bare "yes"
# is a real confirmation. Edit for your domain.
BACKCHANNELS = frozenset({
    "mhm", "mm", "mmhm", "mmhmm",
    "uh", "uhhuh", "huh",
    "um", "umm", "uhm",
    "er", "erm",
    "hmm", "hm",
    "ah", "oh",
    "yeah", "yep", "yup",
    "okay", "ok",
    "right", "alright", "gotcha",
})

_TRANSCRIPT_TYPES = {
    stt.SpeechEventType.INTERIM_TRANSCRIPT,
    stt.SpeechEventType.PREFLIGHT_TRANSCRIPT,
    stt.SpeechEventType.FINAL_TRANSCRIPT,
}

_PUNCT_STRIP = str.maketrans("", "", string.punctuation)

log = logging.getLogger("backchannel_stt_filter")


def _is_all_backchannel(text: str) -> bool:
    """Return True only when every token is a known backchannel."""
    tokens = text.lower().translate(_PUNCT_STRIP).split()
    return bool(tokens) and all(tok in BACKCHANNELS for tok in tokens)


class BackchannelSTTFilterMixin:
    """Drop backchannel-only transcripts while the agent is speaking."""

    _FILTER_GRACE_S: float = 1.0
    _last_speaking_at: float = 0.0

    async def stt_node(
        self, audio: AsyncIterable[rtc.AudioFrame], model_settings: ModelSettings
    ):
        async for ev in Agent.default.stt_node(self, audio, model_settings):
            if self._should_drop(ev):
                text = ev.alternatives[0].text if ev.alternatives else ""
                log.info(
                    "event_filtered transcript=%r ev_type=%s agent_state=%s",
                    text, ev.type, self.session.agent_state,
                )
                continue
            yield ev

    def _should_drop(self, ev: stt.SpeechEvent) -> bool:
        now = time.monotonic()
        if self.session.agent_state == "speaking":
            self._last_speaking_at = now
        elif now - self._last_speaking_at > self._FILTER_GRACE_S:
            return False

        if ev.type not in _TRANSCRIPT_TYPES:
            return False

        text = ev.alternatives[0].text if ev.alternatives else ""
        return _is_all_backchannel(text)

How it works:

While the agent is speaking (plus a 1-second grace window after speech ends), the filter inspects each STT transcript event
It strips punctuation and checks whether every token in the transcript matches the BACKCHANNELS set
Pure-filler transcripts like “mhm” or “yeah okay” are dropped. They never reach LiveKit’s pipeline
Utterances with any non-filler token (e.g., “yeah I want the suite”) always pass through

The BACKCHANNELS set is domain-customizable. “yes” and “no” are deliberately excluded because they often represent genuine confirmations in booking or IVR scenarios. Edit the set for your use case.

Short-utterance buffer clearing

LiveKit accumulates committed FINAL transcripts in a private buffer (_audio_transcript) across the user’s uncommitted turn. The _interrupt_by_audio_activity method checks the running word count against min_words to decide whether to pause TTS.Without intervention, two consecutive short fillers like “yeah” + “um” sum to two words and trip the interrupt gate, even though each utterance on its own is below threshold.This filter listens on the user_input_transcribed event and wipes the buffer whenever the agent is speaking and the user input falls below the configured min_words threshold. Each short utterance is evaluated independently rather than against the accumulated total.

This filter requires interruption.min_words to be set to 2 or higher. Without it, the word-count gate is disabled and the filter has no effect.

from __future__ import annotations

import logging

from livekit.agents import AgentSession


log = logging.getLogger("short_utterance_buffer_filter")


def install_short_utterance_filter(session: AgentSession) -> None:
    """Clear transcript buffers when a short utterance arrives during agent speech."""

    @session.on("user_input_transcribed")
    def _on_user_input_transcribed(ev) -> None:
        word_count = len(ev.transcript.split())
        min_words = session.options.interruption["min_words"]

        if session.agent_state != "speaking":
            return

        if word_count >= min_words:
            return

        activity = getattr(session, "_activity", None)
        recognition = getattr(activity, "_audio_recognition", None) if activity else None
        if recognition is None:
            return

        # Wipe all three transcript buffers so short utterances
        # don't accumulate past the interrupt threshold.
        recognition._audio_transcript = ""
        recognition._audio_interim_transcript = ""
        recognition._audio_preflight_transcript = ""

        # Best-effort: abort any in-flight preemptive LLM call
        # triggered by this short utterance.
        cancel = getattr(activity, "_cancel_preemptive_generation", None)
        if callable(cancel):
            try:
                cancel()
            except Exception:
                log.debug("_cancel_preemptive_generation failed", exc_info=True)

        log.info(
            "buffer_cleared transcript=%r words=%d is_final=%s",
            ev.transcript, word_count, ev.is_final,
        )

This filter accesses LiveKit private APIs (_audio_transcript, _audio_interim_transcript, _audio_preflight_transcript, _cancel_preemptive_generation). Pin your livekit-agents version to avoid breakage on minor updates. Tested with livekit-agents>=1.5.

Wiring both filters into your agent

Three changes are needed:1. Mix in the backchannel filter on your agent class. Place it before Agent in the class bases so its stt_node runs first:

from filters.backchannel_stt import BackchannelSTTFilterMixin

class MyAgent(BackchannelSTTFilterMixin, Agent):
    def __init__(self) -> None:
        super().__init__(instructions="You are a helpful voice AI assistant.")

2. Configure the session with interruption.min_words >= 2. This enables the word-count gate that the buffer-clearing filter depends on:

session = AgentSession(
    stt=assemblyai.STT(
        model="universal-3-5-pro",
        min_turn_silence=100,
        max_turn_silence=1000,
        vad_threshold=0.3,
    ),
    # llm=your_llm_plugin(),
    # tts=your_tts_plugin(),
    vad=None,  # Recommended: disable VAD so only STT drives interruption
    turn_handling={
        "turn_detection": "stt",
        "endpointing": {"min_delay": 1.0, "max_delay": 4.0},
        "interruption": {
            "enabled": True,
            "resume_false_interruption": True,
            "false_interruption_timeout": 1.5,
            "min_words": 2,  # Required for the buffer-clearing filter
        },
    },
)

3. Install the buffer-clearing filter on the session, then start it with your agent:

from filters.short_utterance_buffer import install_short_utterance_filter

install_short_utterance_filter(session)

await session.start(room=ctx.room, agent=MyAgent())

Setting vad=None with turn_detection="stt" is the recommended setup for both filters. This ensures only STT-based signals drive interruption, avoiding timing races from a competing VAD interrupt path. If you need VAD for faster barge-in, both filters still work: set interruption.min_words to 2 and ensure Silero’s activation_threshold matches vad_threshold.

Expected behavior with both filters active and min_words=2:

User utterance (during agent speech)	Backchannel filter	Buffer-clearing filter	Result
”mhm”	Dropped	-	Agent continues
”um”	Dropped	-	Agent continues
”yeah yeah”	Dropped	-	Agent continues
Unknown short word	Passes through	Buffer cleared	Agent continues
”yeah I’d like the suite”	Passes through	Passes through	Agent interrupts
”suite please”	Passes through	Passes through	Agent interrupts
”mhm” (1+ second after agent stops)	Passes through	-	Agent responds

Dynamic configuration

You can update prompt, keyterms_prompt, agent_context, min_turn_silence, max_turn_silence, continuous_partials, and interruption_delay during an active session using update_options — no reconnect required. This lets you adapt to the conversation stage: boost names while collecting caller details, widen silence windows while a user dictates an email, then tighten them again.

# Update one or more options mid-stream
stt.update_options(
    max_turn_silence=3000,  # Increase for entity dictation
)

# Later, reset to default
stt.update_options(
    max_turn_silence=1000,
)

Conversation stage	Adjustment
Caller identification (names, account IDs)	Boost terms with `update_options(keyterms_prompt=[...])`
Entity dictation (email, phone, address)	Raise `max_turn_silence` to ~`2000`–`4000` ms, then lower it again afterward
After each agent reply	Set `agent_context_carryover=True` (1.6.5+) or push `agent_context` manually (see Conversation context)
Faster barge-in	Lower `interruption_delay` (see Latency)

For more information, see Updating configuration mid-stream.

Troubleshooting

Issue	Cause	Solution
Extra latency with `turn_detection="stt"`	LiveKit’s endpointing `min_delay` is additive in STT mode	Set `endpointing={"min_delay": 0}` inside `TurnHandlingOptions` on `AgentSession`
No interruption handling	Missing VAD	Ensure `vad=silero.VAD` is set, with `activation_threshold` equal to `vad_threshold` (default `0.3`)
Turn over-segmentation	`min_turn_silence` too low	Increase from `100` to `200`–`500`
Entities split across turns	`max_turn_silence` too low	Increase `max_turn_silence` (e.g., `1500`–`3500`)
Latency on non-terminal utterances	`max_turn_silence` too high	Lower `max_turn_silence`
Mis-heard names, brands, or jargon	No vocabulary hints	Add `keyterms_prompt`, or supply `prompt`/`agent_context` for context
Poor accuracy in noisy audio	Background noise or room echo	Enable `voice_focus` (`near-field` or `far-field`)

Migrating from another STT provider

To balance accuracy, latency, turn-taking, and interruption handling, map your current setup to AssemblyAI using the questions below. Each answer points to the settings that reproduce — and usually improve on — your current behavior.

How are you detecting end-of-turn today?

Today	Recommended on AssemblyAI
Your STT provider’s own end-of-turn model (e.g. Deepgram endpointing)	`turn_detection="stt"` with `min_turn_silence=100`, `max_turn_silence=1000`, `endpointing={"min_delay": 0}`. AssemblyAI’s punctuation-based end-of-turn replaces it.
Silence / VAD only	Prefer `turn_detection="stt"` for semantic endpointing, or `turn_detection="vad"` to stay silence-only. Align Silero `activation_threshold` and `vad_threshold` at `0.3`.
LiveKit’s turn-detector model	`turn_detection=MultilingualModel()`, keep the plugin defaults (`min_turn_silence=100`, `max_turn_silence=100`), set `endpointing={"min_delay": 0.5, "max_delay": 3.0}`, and keep Silero VAD (required).

Which model and settings are you migrating from?

What you pass today	AssemblyAI equivalent
Current model (Deepgram, ElevenLabs, etc.)	`model="universal-3-5-pro"` (recommended flagship)
Overall accuracy/latency tuning	`mode="min_latency"` / `"balanced"` / `"max_accuracy"` — a one-line starting point before fine-tuning
Endpointing / silence thresholds	`min_turn_silence` (speculative end-of-turn) and `max_turn_silence` (forced end) — see Turn detection
Custom vocabulary / keywords	`keyterms_prompt=[...]`; broader domain context → `prompt`
Formatting / punctuation toggles	On by default — formatted transcripts always (`format_turns` does not apply)
Language(s)	18 languages with code-switching; pin a language via `prompt`; `language_detection` adds language codes

VAD, interruptions, and infrastructure

Today	Recommended
Silero VAD	Keep it; set `activation_threshold=0.3` to match `vad_threshold`
Adaptive vs. VAD-based barge-in	LiveKit Cloud → adaptive interruption handling; self-hosted → the self-hosted filters
Telephony / SIP routing	`sample_rate=8000` and `encoding="pcm_mulaw"` for 8 kHz telephony
Client-side noise cancellation	Drop it; use server-side Voice Focus instead
LiveKit Cloud vs. self-hosting	Cloud unlocks adaptive interruption; self-hosting uses the filters and version pinning

Parameter cheat sheet

If you’re coming from AssemblyAI’s standard streaming model:

Change	From	To
Model	`assemblyai.STT()`	`assemblyai.STT(model="universal-3-5-pro")`
Turn detection	`turn_detection="stt"` or `EnglishModel()`	`turn_detection="stt"` or `MultilingualModel()`
VAD	Optional	Set `vad=silero.VAD.load()` to match `vad_threshold`
`min_turn_silence`	`400` (old default)	`100` (new default)
`max_turn_silence`	`1280` (old default)	`1000` (API default) or `100` (with 3rd-party turn detector)
`end_of_turn_confidence_threshold`	Configurable	Not applicable: Universal 3.5 Pro Realtime uses punctuation-based turn detection
`endpointing.min_delay` (formerly `min_endpointing_delay`)	Default `0.5`	Set to `0` inside `TurnHandlingOptions` when using `turn_detection="stt"`

Migrating a production deployment? Talk to our team.

​Overview

Turn detection

Latency

Accuracy

Interruptions

​Quickstart

​Parameters reference

​Universal 3.5 Pro Realtime parameters

​General STT parameters

​Legacy parameters

​Turn detection

​Default parameter differences

​STT-based Turn Detection (recommended)

​LiveKit turn detection (with MultilingualModel())

​Other turn detection modes

​VAD configuration

​Threshold alignment

​Entity splitting tradeoff

​min_turn_silence too low

​max_turn_silence too low

​Latency

​Latency breakdown

​Accuracy

​Prompting

​Key terms

​Conversation context

​Wiring agent context with agent_context_carryover (recommended)

​Wiring agent context with conversation_item_added (pre-1.6.5)

​Voice focus

​Interruption handling

​Recommended: Adaptive interruption handling

​Self-hosted alternative

​Dynamic configuration

​Troubleshooting

​Migrating from another STT provider

​How are you detecting end-of-turn today?

​Which model and settings are you migrating from?

​VAD, interruptions, and infrastructure

​Parameter cheat sheet

Overview

Quickstart

Parameters reference

Universal 3.5 Pro Realtime parameters

General STT parameters

Legacy parameters

Turn detection

Default parameter differences

STT-based Turn Detection (recommended)

LiveKit turn detection (with `MultilingualModel()`)

Other turn detection modes

VAD configuration

Threshold alignment

Entity splitting tradeoff

`min_turn_silence` too low

`max_turn_silence` too low

Latency

Latency breakdown

Accuracy

Prompting

Key terms

Conversation context

Wiring agent context with `agent_context_carryover` (recommended)

Wiring agent context with `conversation_item_added` (pre-1.6.5)

Voice focus

Interruption handling

Recommended: Adaptive interruption handling

Self-hosted alternative

Dynamic configuration

Troubleshooting

Migrating from another STT provider

How are you detecting end-of-turn today?

Which model and settings are you migrating from?

VAD, interruptions, and infrastructure

Parameter cheat sheet