Universal 3.5 Pro Realtime on Pipecat

Overview

This guide covers integrating AssemblyAI’s Universal 3.5 Pro Realtime speech-to-text model into a Pipecat voice agent. Everything here applies equally to Universal-3.5 Pro Streaming (universal-3-5-pro) — both belong to the same U3 Pro family and share every parameter in this guide, so you can swap the model string without changing anything else.

Universal 3.5 Pro Realtime is our flagship next-generation streaming model for voice agents — multilingual and promptable, with conversation context and voice focus.Available on Pipecat 1.4.0+ — set model="universal-3-5-pro".

AssemblyAI provides the speech-to-text and (optionally) the turn detection in your Pipecat pipeline: Once you have an agent running, tune it for what matters most to your use case:

Turn detection

Decide when the user is done speaking — the two Pipecat modes, defaults, and entity tuning.

Latency

Shorten the gap between the user finishing and the agent replying.

Accuracy

Prompting, key terms, conversation context, and noise handling.

Interruptions

Natural barge-in while the agent is speaking.

Pipecat AssemblyAI STT plugin

View Pipecat’s AssemblyAI STT plugin reference.

Quickstart

Get a working, talking agent in a few minutes, then optimize from there.

Install Pipecat

Install Pipecat with the AssemblyAI, LLM, and TTS extras you need:

pip install "pipecat-ai[assemblyai,openai,cartesia]" python-dotenv

What’s included:

assemblyai: AssemblyAI U3 Pro STT service
openai: OpenAI LLM service (used in the example)
cartesia: Cartesia TTS service (used in the example)

The example uses OpenAI and Cartesia, but you can use any LLM or TTS supported by Pipecat — just swap the extras (e.g., pipecat-ai[assemblyai,anthropic,elevenlabs]).

Universal 3.5 Pro Realtime, automatic conversation context, and Voice Focus require pipecat-ai 1.4.0+. Older versions won’t recognize the universal-3-5-pro model.

Set your API keys

Set your API keys in a .env file:

ASSEMBLYAI_API_KEY=your_assemblyai_key
OPENAI_API_KEY=your_openai_key
CARTESIA_API_KEY=your_cartesia_key

You can obtain an AssemblyAI API key by signing up for a free account and navigating to the API Keys tab of the dashboard.

Build a minimal agent

The example below uses Pipecat-controlled turn detection (the default). Pay attention to the comments for switching to AssemblyAI’s built-in turn detection, and note that the assistant aggregator at the end of the pipeline is what enables automatic conversation context.

import os

from dotenv import load_dotenv
from loguru import logger

from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.worker import PipelineParams, PipelineWorker
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
    LLMContextAggregatorPair,
    LLMUserAggregatorParams,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.assemblyai.stt import AssemblyAISTTService
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.services.openai.llm import OpenAILLMService
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.workers.runner import WorkerRunner

load_dotenv()

transport_params = {
    "daily": lambda: DailyParams(audio_in_enabled=True, audio_out_enabled=True),
    "webrtc": lambda: TransportParams(audio_in_enabled=True, audio_out_enabled=True),
}


async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
    stt = AssemblyAISTTService(
        api_key=os.environ["ASSEMBLYAI_API_KEY"],
        settings=AssemblyAISTTService.Settings(
            model="universal-3-5-pro",
            min_turn_silence=100,
            # max_turn_silence is auto-synced to min_turn_silence in Pipecat mode.
            # vad_threshold=0.3,            # Align with your local VAD's threshold
            # continuous_partials=True,     # Default — steady ~3s partials during long turns
            # interruption_delay=0,         # Optional: faster first partial (~300ms effective)
        ),
        vad_force_turn_endpoint=True,  # Pipecat mode (default).
        # Set False to use AssemblyAI's built-in turn detection (universal-3-5-pro / universal-3-5-pro only):
        # vad_force_turn_endpoint=False,
    )

    llm = OpenAILLMService(api_key=os.environ["OPENAI_API_KEY"])
    tts = CartesiaTTSService(api_key=os.environ["CARTESIA_API_KEY"])

    context = LLMContext()
    user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
        context,
        user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
    )

    pipeline = Pipeline(
        [
            transport.input(),     # Transport user input
            stt,                   # STT
            user_aggregator,       # User responses
            llm,                   # LLM
            tts,                   # TTS
            transport.output(),    # Transport bot output
            assistant_aggregator,  # Assistant responses → automatic conversation context
        ]
    )

    worker = PipelineWorker(pipeline, params=PipelineParams(enable_metrics=True))

    @transport.event_handler("on_client_connected")
    async def on_client_connected(transport, client):
        context.add_message(
            {"role": "system", "content": "You are a helpful voice assistant. Keep replies brief and speakable."}
        )
        await worker.queue_frames([LLMRunFrame()])

    runner = WorkerRunner(handle_sigint=runner_args.handle_sigint)
    await runner.add_workers(worker)
    await runner.run()


async def bot(runner_args: RunnerArguments):
    transport = await create_transport(runner_args, transport_params)
    await run_bot(transport, runner_args)


if __name__ == "__main__":
    from pipecat.runner.run import main

    main()

Two complete, runnable examples live in the Pipecat repo: voice-assemblyai.py (Pipecat turn detection) and voice-assemblyai-turn-detection.py (AssemblyAI’s built-in turn detection).

Run and test

Run the agent directly with local audio:

python your_agent.py

Speak into your microphone after hearing the greeting. For WebRTC or Daily testing, see Running your agent.

Parameters reference

Universal 3.5 Pro Realtime parameters

These are the key parameters to tune. Set them inside AssemblyAISTTService.Settings(...). They apply to the whole U3 Pro family (universal-3-5-pro and universal-3-5-pro).

model

str

default:"universal-3-5-pro"

The streaming model. "universal-3-5-pro" is the recommended flagship model; the plugin currently defaults to "universal-3-5-pro", so set model explicitly. Both belong to the U3 Pro family and share every parameter below.

mode

str

Accuracy/latency preset: "min_latency", "balanced", or "max_accuracy". Sets sensible defaults for mode-dependent fields; any value you set explicitly still takes precedence. The server defaults to "balanced". Construction-time only. U3 Pro family only. See Optimizing accuracy and latency.

keyterms_prompt

list[str]

List of terms to boost recognition for. Used on its own, your terms are appended to the default prompt automatically. Can’t be set in the same request as prompt — see Key terms to combine boosting with a custom prompt.

prompt

str

Contextual prompt — a natural-language description of what the audio is about (domain, scenario, or full details). Can’t be set in the same request as keyterms_prompt; fold the terms into the prompt text instead (see Key terms). Prompting is currently a beta feature: see Prompting for more information.

agent_context

str

Context carryover seed — your agent’s most recent spoken reply, up to ~1500 characters, used to transcribe the next user turn more accurately. Set it at construction time to seed an opening greeting; later turns are fed automatically. U3 Pro family only. See Conversation context.

previous_context_n_turns

int

default:"5"

How many prior conversation entries are carried forward automatically. Range 0–100; 0 disables carryover entirely (including the automatic agent_context feed). Construction-time only; leave unset for the server default (5 on universal-3-5-pro; 3 on older u3-rt-pro). U3 Pro family only.

min_turn_silence

int

default:"100"

Milliseconds of silence before a speculative end-of-turn check. When the check fires, the model looks for terminal punctuation (. ? !) to decide whether the turn has ended. (Formerly min_end_of_turn_silence_when_confident, deprecated but still supported with a warning.)

max_turn_silence

int

default:"1000"

Maximum silence before the turn is forced to end, regardless of punctuation. Auto-synced to min_turn_silence in Pipecat mode; respected as configured in AssemblyAI’s built-in turn detection mode.

vad_threshold

float

default:"0.3"

AssemblyAI’s internal VAD threshold (0.0–1.0) for classifying audio frames as silence. Align with your local VAD’s activation threshold to avoid a “dead zone” where AssemblyAI transcribes speech your VAD hasn’t detected yet.

voice_focus

str

Server-side noise suppression that isolates the primary speaker. "near-field" for close-talking mics, "far-field" for distant capture. Construction-time only. U3 Pro family only. See Voice focus.

voice_focus_threshold

float

How aggressively voice_focus suppresses background audio. 0.0–1.0; higher is more aggressive. Only takes effect when voice_focus is set. Construction-time only. U3 Pro family only.

continuous_partials

bool

default:"True"

Whether to emit additional partial transcripts during long turns at a steady ~3 second cadence. When enabled (default on both the API and this plugin), additional partials covering the full turn transcript are emitted approximately every 3 seconds while speech continues. When disabled, only one early partial is emitted near turn start. The first partial (at 750ms) is unaffected. Useful when downstream consumers (LLMs, UI, eager inference) need frequent updates during long, uninterrupted turns. See Continuous partials for details.

interruption_delay

int

default:"500"

How soon the first partial transcript is emitted during a turn, in milliseconds. Range: 0–1000. Lower values produce faster time to first token (TTFT) for barge-in and speculative inference; higher values produce more confident first partials. The server adds a minimum of 300ms on top of the configured value (interruption_delay=0 → ~300ms effective, interruption_delay=500 → ~800ms effective). See Tuning early partial timing for details.

language_detection

bool

Universal 3.5 Pro Realtime code-switches natively between supported languages. This parameter controls whether language_code and language_confidence are included in turn messages.

speaker_labels

bool

default:"False"

Enable speaker diarization. See Speaker diarization.

General parameters

These apply across models and Pipecat setups. api_key, vad_force_turn_endpoint, should_interrupt, and speaker_format are passed directly to AssemblyAISTTService(...), not inside Settings.

api_key

str

required

Your AssemblyAI API key.

vad_force_turn_endpoint

bool

default:"True"

True for Pipecat mode (VAD + Smart Turn controls turns); False for AssemblyAI’s built-in turn detection (universal-3-5-pro / universal-3-5-pro only). See Turn detection.

should_interrupt

bool

default:"True"

Whether the user starting to speak interrupts the bot. Only applies in AssemblyAI’s built-in turn detection mode (vad_force_turn_endpoint=False).

speaker_format

str

Template string for formatting speaker labels (e.g., "[{speaker}] {text}"). Used with speaker_labels.

sample_rate

int

default:"16000"

The sample rate of the audio stream.

encoding

str

default:"pcm_s16le"

The encoding of the audio stream. Allowed values: pcm_s16le, pcm_mulaw.

Legacy parameters

These apply to the universal-streaming-english and universal-streaming-multilingual models, but do not affect Universal 3.5 Pro Realtime or universal-3-5-pro:

end_of_turn_confidence_threshold

float

Confidence threshold for end-of-turn detection. The U3 Pro family uses punctuation-based turn detection instead, so this parameter has no effect.

format_turns

bool

default:"True"

Whether to return formatted final transcripts. The U3 Pro family always returns formatted transcripts, so this parameter no longer applies.

Turn detection

In Pipecat, you choose which component decides when the user is done speaking with the vad_force_turn_endpoint flag on AssemblyAISTTService. The U3 Pro family uses a punctuation-based end-of-turn system: after a period of silence, the model checks for terminal punctuation (. ? !) rather than a confidence score. For more on how this works, see Configuring turn detection.

The vad_force_turn_endpoint parameter controls which turn detection mode is used. It defaults to True (Pipecat mode), which sends a ForceEndpoint message to AssemblyAI when the local VAD detects silence. Set it to False to use AssemblyAI’s built-in turn detection instead. Choosing the right mode is critical for balancing responsiveness and turn accuracy in your voice agent.

Pipecat mode (default, recommended)

When to use: Most voice agent applications requiring responsive interruptions.

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        min_turn_silence=100,
    ),
    vad_force_turn_endpoint=True,  # Default (Pipecat mode)
)

How it works:

VAD + the Smart Turn analyzer control when the user is done speaking.
A ForceEndpoint message is sent to AssemblyAI on VAD silence detection.
max_turn_silence is automatically synchronized with min_turn_silence.
Best for low-latency, responsive voice agents.

AssemblyAI’s built-in turn detection

When to use: When you want AssemblyAI’s punctuation-based turn detection to control turn endings, configured through the settings below.

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        min_turn_silence=100,
        max_turn_silence=1000,  # Now respected independently
    ),
    vad_force_turn_endpoint=False,  # AssemblyAI's built-in turn detection
)

How it works:

User speaks → audio streams to AssemblyAI.
User pauses for min_turn_silence (e.g., 100ms) → the model checks for terminal punctuation.
If terminal punctuation (. ? !) is found → the turn ends immediately.
If not → a partial is emitted and the turn continues waiting.
If silence reaches max_turn_silence (e.g., 1000ms) → the turn is forced to end regardless.

In this mode all timing parameters are respected as configured, the service emits UserStartedSpeakingFrame / UserStoppedSpeakingFrame, and SpeechStarted events drive fast barge-in. Only available with universal-3-5-pro / universal-3-5-pro (other models require Pipecat mode).

Entity splitting tradeoff

Lower min_turn_silence and max_turn_silence values produce faster transcripts but can split entities or utterances across turns. The two parameters affect this differently.

`min_turn_silence` too low

The speculative check fires too early, splitting entities on punctuation:

# With (min_turn_silence=100, max_turn_silence=1000)
"It's John."                    → FINAL (100ms pause, check fires, period found → turn ends)
"Smith."                        → FINAL
"At gmail.com."                 → FINAL

# With (min_turn_silence=400, max_turn_silence=1000)
"It's john.smith@gmail.com."    → FINAL (single turn, properly formatted)

`max_turn_silence` too low

The forced turn-end cuts off the user mid-thought:

# With (min_turn_silence=100, max_turn_silence=1000)
"I wanted to check on my order from..."  → FINAL (1000ms silence, forced end)
"last Tuesday, order number 4829."       → FINAL (new turn)

# With (min_turn_silence=100, max_turn_silence=2000)
"I wanted to check on my order from last Tuesday, order number 4829."  → FINAL (single turn)

Universal 3.5 Pro Realtime’s formatting is significantly better when it has full context in a single turn — email addresses, phone numbers, credit card numbers, and physical addresses all benefit. If your use case involves alphanumeric dictation, raise max_turn_silence during those portions of the conversation (e.g., to 2000–4000 ms) using dynamic configuration, then lower it again afterward. In Pipecat mode, raise min_turn_silence (which max_turn_silence follows) for the same effect.

Latency

A voice agent feels responsive when the gap between the user finishing and the agent replying is short. Start with the mode preset — the highest-level dial for the accuracy/latency trade-off. It sets sensible defaults for the fine-grained levers below, so you can pick a target and tune from there:

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        mode="balanced",  # "min_latency" (fastest) · "balanced" · "max_accuracy" (best quality)
    ),
)

mode is set at construction time (it can’t be changed mid-session) and influences the defaults of the levers below. Any value you set explicitly still wins. Leave it unset to use the server’s default preset. See Optimizing accuracy and latency. From there, fine-tune the individual levers:

End-of-turn timing. min_turn_silence (speculative check) and max_turn_silence (forced end) directly control how soon a turn ends. Lower is faster but risks splitting entities — see Turn detection.
Time to first partial. interruption_delay controls how soon the first partial is emitted, which drives faster barge-in and speculative inference. The server adds a minimum of 300ms on top of the configured value.
Sample rate. Use 16 kHz (sample_rate=16000). Higher rates don’t improve accuracy and only add bandwidth.
Continuous partials. continuous_partials (on by default) emits a partial every ~3 seconds during long turns. Leave it on for steady mid-turn updates, or disable it if you only need a single early partial.
Skip client-side preprocessing. Don’t run your own noise cancellation before audio reaches the model — the artifacts it introduces usually hurt accuracy more than the original noise. Use server-side Voice Focus instead.

Latency breakdown

Stage	Typical	Controlled by
Network round trip	~50 ms	—
Speech-to-text	~200–300 ms	model
First partial (TTFT)	configured `interruption_delay` + ~300 ms server min	`interruption_delay`
End of turn (terminal punctuation found)	`min_turn_silence` (default 100 ms)	`min_turn_silence`
End of turn (no punctuation, forced)	up to `max_turn_silence`	`max_turn_silence`

Accuracy

Universal 3.5 Pro Realtime is accurate out of the box. When you need more — domain vocabulary, proper nouns, noisy audio — reach for these levers. For entity-heavy dictation, also tune turn detection (see Entity splitting tradeoff), and note that the high-level mode preset shifts the overall accuracy/latency balance (use max_accuracy to favor quality).

Prompting

Beta featurePrompting is considered a beta feature for Universal 3.5 Pro Realtime.While it can be a powerful tool for improving accuracy in certain use cases, we recommend starting without a prompt to first establish baseline performance. Once the baseline has been tested, you can add context to further optimize for your use case (e.g., language mix to expect, use case or domain).

Universal 3.5 Pro Realtime supports a prompt parameter for contextual prompting — a description of what the audio is about. Transcription behavior (verbatim output, punctuation, turn detection) is built in and optimized automatically; the prompt carries context, not instructions.

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        prompt="Customer support call about an internet service outage.",
    ),
)

Key terms

Use keyterms_prompt to boost recognition of specific names, brands, or domain terms. On its own, your terms are appended to the default prompt automatically — so you get boosting and prompting together:

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        keyterms_prompt=["Xiomara", "Saoirse", "Pipecat", "AssemblyAI"],
    ),
)

You can’t pass prompt and keyterms_prompt in the same request — doing so raises a validation error. You don’t have to give up term boosting to use a contextual prompt, though. Either:

Pass keyterms_prompt on its own — your terms are appended to the default prompt automatically, or
Fold the terms into a custom prompt, e.g. end it with "Make sure to boost the words Xiomara, Saoirse, Pipecat in the audio."

Conversation context

Give the model both sides of the dialog so it transcribes the next user turn more accurately. Universal 3.5 Pro Realtime keeps a short, per-session memory of the conversation from two sources:

The agent half — what your agent just said.
The user half — prior STT-finalized user turns.

With the agent’s question in context, the model can anticipate the answer, sharpen entity recognition, and disambiguate similar-sounding words. For example, after your agent asks "What's your email address?", the model can produce "user@assemblyai.com" instead of "user at assemblyai dot com". This has the biggest impact on short replies ("yes", "7pm", single names) and spelled-out entities. See Conversation context for the full reference.

In Pipecat, conversation context is automatic — no event wiring required. As long as your pipeline includes the standard LLM context aggregator (the assistant_aggregator from LLMContextAggregatorPair), Pipecat broadcasts an LLMContextAssistantTurnFrame when each bot turn completes, and AssemblyAISTTService feeds that reply to the model as agent_context automatically. Just use a U3 Pro family model on pipecat-ai 1.4.0+.

Parameter	Type	Description
`agent_context`	str	Your agent’s most recent spoken reply, up to ~1500 characters. Set it at construction time to seed an opening greeting; subsequent replies are fed automatically.
`previous_context_n_turns`	int	How many prior conversation entries are carried forward automatically. Range `0`–`100`; `0` disables carryover entirely. Construction-time only; server default is `5` on `universal-3-5-pro` (`3` on older `u3-rt-pro`).

Seeding the opening greeting

The automatic feed kicks in once your agent completes its first turn. To give the model context for the user’s very first reply (the answer to your greeting), set agent_context at construction time:

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        # Seed the opening line; later turns are fed automatically by the aggregator.
        agent_context="Hi! Thanks for calling Acme. What's the email on your account?",
        # previous_context_n_turns=5,  # Default on universal-3-5-pro. Set 0 to disable carryover entirely.
    ),
)

Manual control with `update_agent_context()`

If your pipeline doesn’t use the standard LLM context aggregator, or you want explicit control over what the model sees, push the agent’s reply yourself. This is a live update — no reconnect required:

# Whenever your agent finishes speaking:
await stt.update_agent_context("Your account is past due. Would you like to pay now?")

agent_context, previous_context_n_turns, and update_agent_context() are supported only on the U3 Pro family (universal-3-5-pro, universal-3-5-pro). Values are clipped to ~1500 characters and re-seeded automatically on reconnect. Setting previous_context_n_turns=0 disables the automatic feed as well.

Voice focus

Voice Focus isolates the primary speaker and suppresses background noise — chatter, keyboard clicks, fan hum, room echo — server-side, before audio reaches the model. Use it instead of client-side noise cancellation, which tends to introduce artifacts that hurt accuracy more than the noise itself.

Parameter	Type	Description
`voice_focus`	str	`"near-field"` for headsets, handsets, and other close-talking mics; `"far-field"` for conference rooms, laptop mics, and other distant capture.
`voice_focus_threshold`	float	Optional. `0.0`–`1.0`; higher values suppress background audio more aggressively.

Both are construction-time parameters on the U3 Pro family. See Voice Focus for details.

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        voice_focus="far-field",    # "near-field" for close-talking mics
        voice_focus_threshold=0.5,  # Optional: 0.0–1.0, higher = more aggressive
    ),
)

Interruption handling

Barge-in — the user interrupting while the agent is speaking — is handled by Pipecat, and the signals that drive it depend on your turn detection mode.

Pipecat mode (vad_force_turn_endpoint=True). Pipecat’s local VAD and the Smart Turn analyzer detect the user starting to speak and interrupt the bot’s TTS. AssemblyAI also emits SpeechStarted events as a backstop.
AssemblyAI’s built-in turn detection (vad_force_turn_endpoint=False). The service emits UserStartedSpeakingFrame / UserStoppedSpeakingFrame and uses AssemblyAI’s SpeechStarted events for fast barge-in. Set should_interrupt=False (constructor argument) to disable barge-in entirely in this mode.

{"type": "SpeechStarted", "timestamp": 14400, "confidence": 0.79}

On detection, Pipecat stops TTS playback and switches to listening. To reduce false interruptions from short backchannels ("mhm", "yeah", "okay"), keep your VAD threshold aligned with vad_threshold and lean on Pipecat’s Smart Turn analyzer, which evaluates whether speech is a genuine turn rather than a filler.

Dynamic configuration

Update settings mid-conversation by queueing an STTUpdateSettingsFrame with a settings delta — adapt to the conversation stage as it unfolds. See stt-assemblyai.py for a complete working example.

from pipecat.frames.frames import STTUpdateSettingsFrame
from pipecat.services.assemblyai.stt import AssemblyAISTTService

# Update keyterms during the conversation
await worker.queue_frame(
    STTUpdateSettingsFrame(
        delta=AssemblyAISTTService.Settings(
            keyterms_prompt=["NewName", "NewCompany"],
        )
    )
)

# Widen the silence window during entity dictation
await worker.queue_frame(
    STTUpdateSettingsFrame(
        delta=AssemblyAISTTService.Settings(
            min_turn_silence=200,
            max_turn_silence=3000,  # Respected in AssemblyAI's built-in turn detection mode
        )
    )
)

agent_context is the only setting applied live. Changing any other setting via STTUpdateSettingsFrame reconnects the AssemblyAI session to apply it (a brief interruption). To push conversation context without a reconnect, use the dedicated stt.update_agent_context(...) method — see Conversation context.

Conversation stage	Adjustment
Caller identification (names, account IDs)	Boost terms with `keyterms_prompt`
Entity dictation (email, phone, address)	Raise `max_turn_silence` to ~`2000`–`4000` ms, then lower it again afterward
After each agent reply	Automatic — or push `agent_context` via `update_agent_context()`
Faster barge-in	Lower `interruption_delay`

For more information, see Updating configuration mid-stream.

Speaker diarization

Identify different speakers in multi-party conversations.

Basic diarization

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        speaker_labels=True,
    ),
)

Speaker labels (e.g., "A", "B", "C") are included in final transcripts.

With custom formatting

Format transcripts with speaker labels for LLM context:

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        speaker_labels=True,
    ),
    speaker_format="<{speaker}>{text}</{speaker}>",
)

Format options:

Style	Format string
XML	`<{speaker}>{text}</{speaker}>`
Markdown	`{speaker}: {text}`
Bracket	`[{speaker}] {text}`

Running your agent

Development mode (local audio)

python your_agent.py

Speak into your microphone after hearing the greeting.

Production with Daily

For production deployments, use the Daily transport for WebRTC-based real-time audio/video. Your agent joins a Daily room as a participant and handles audio I/O through Daily’s infrastructure.

Telephony with Telnyx

When bridging phone calls through Pipecat (e.g., via Telnyx), the audio is 8 kHz, not 16 kHz. Match the transport sample rates:

transport = TelnyxTransport(
    # ...
    audio_in_sample_rate=8000,
    audio_out_sample_rate=8000,
)

Troubleshooting

Issue	Cause	Solution
`universal-3-5-pro` not recognized	`pipecat-ai` older than 1.4.0	Upgrade: `pip install -U "pipecat-ai[assemblyai]"`
Turn over-segmentation	`min_turn_silence` too low	Increase from `100` to `200`–`500`
Entities split across turns	`max_turn_silence` too low (AssemblyAI mode)	Increase `max_turn_silence` (e.g., `1500`–`3500`); in Pipecat mode, raise `min_turn_silence`
Latency on non-terminal utterances	`max_turn_silence` too high	Lower `max_turn_silence`
Conversation context has no effect	Non-U3-Pro model, or `previous_context_n_turns=0`	Use a U3 Pro family model and leave `previous_context_n_turns` unset (or > 0)
Mid-session setting change drops audio	Reconnect on a non-`agent_context` setting change	Expected — only `agent_context` updates live; use `update_agent_context()` for context
Mis-heard names, brands, or jargon	No vocabulary hints	Add `keyterms_prompt`, or supply `prompt`/`agent_context` for context
Poor accuracy in noisy audio	Background noise or room echo	Enable `voice_focus` (`near-field` or `far-field`)

Migrating from another STT provider

To balance accuracy, latency, turn-taking, and interruption handling, map your current setup to AssemblyAI using the questions below.

How are you detecting end-of-turn today?

Today	Recommended on AssemblyAI
Your STT provider’s own end-of-turn model	AssemblyAI’s built-in turn detection: `vad_force_turn_endpoint=False` with `min_turn_silence=100`, `max_turn_silence=1000`.
Silence / VAD only, with your own turn logic	Pipecat mode (`vad_force_turn_endpoint=True`, default). VAD + Smart Turn decide turns; AssemblyAI returns finals ASAP.
You want the framework to own turn-taking	Pipecat mode (default) — Pipecat’s Smart Turn analyzer makes the turn decision.

Which model and settings are you migrating from?

What you pass today	AssemblyAI equivalent
Current model (Deepgram, ElevenLabs, etc.)	`model="universal-3-5-pro"` (recommended flagship) or `"universal-3-5-pro"`
Overall accuracy/latency tuning	`mode="min_latency"` / `"balanced"` / `"max_accuracy"` — a one-line starting point before fine-tuning
Endpointing / silence thresholds	`min_turn_silence` (speculative end-of-turn) and `max_turn_silence` (forced end)
Custom vocabulary / keywords	`keyterms_prompt=[...]`; broader domain context → `prompt`
Provider-side conversation context	Automatic — include the LLM context aggregator; seed greetings via `agent_context`
Formatting / punctuation toggles	On by default — formatted transcripts always (`format_turns` does not apply)
Telephony / SIP routing	`sample_rate=8000` and `encoding="pcm_mulaw"` for 8 kHz telephony
Client-side noise cancellation	Drop it; use server-side Voice Focus instead

Migrating a production deployment? Talk to our team.

Speech model comparison

Interested in using a different model?

Feature	U3 Pro family (`universal-3-5-pro`, `universal-3-5-pro`)	universal-streaming-english	universal-streaming-multilingual
Turn Detection Modes
Pipecat mode (VAD + Smart Turn)	✅	✅	✅
AssemblyAI turn detection mode	✅	❌	❌
Turn Detection Parameters
`min_turn_silence`	✅	✅	✅
`max_turn_silence`	✅	✅	✅
`end_of_turn_confidence_threshold`	❌	✅ (1.0)	✅ (1.0)
`continuous_partials`	✅	❌	❌
`interruption_delay`	✅	❌	❌
Advanced Features
Keyterms boosting	✅	✅	✅
Custom prompting (beta)	✅	❌	❌
Conversation context (carryover)	✅	❌	❌
Voice Focus	✅	❌	❌
Speaker diarization	✅	✅	✅
Dynamic parameter updates	✅	✅	✅
Language Support
Multilingual code switching	✅	❌	✅
Language detection	✅	❌	✅

Legend:

✅ Fully supported and recommended
❌ Not supported / Not used

The U3 Pro family is recommended for all new voice agent implementations. The universal-streaming models are maintained for backward compatibility but lack the optimizations and features specifically designed for real-time conversational AI.

The end_of_turn_confidence_threshold parameter is not used with the U3 Pro family (it won’t affect behavior). For universal-streaming models, Pipecat automatically sets it to 1.0 in Pipecat mode to disable semantic turn detection and ensure fast responses. You don’t need to configure this parameter manually.

​Overview

Turn detection

Latency

Accuracy

Interruptions

Pipecat AssemblyAI STT plugin

​Quickstart

​Parameters reference

​Universal 3.5 Pro Realtime parameters

​General parameters

​Legacy parameters

​Turn detection

​Pipecat mode (default, recommended)

​AssemblyAI’s built-in turn detection

​Entity splitting tradeoff

​min_turn_silence too low

​max_turn_silence too low

​Latency

​Latency breakdown

​Accuracy

​Prompting

​Key terms

​Conversation context

​Seeding the opening greeting

​Manual control with update_agent_context()

​Voice focus

​Interruption handling

​Dynamic configuration

​Speaker diarization

​Basic diarization

​With custom formatting

​Running your agent

​Development mode (local audio)

​Production with Daily

​Telephony with Telnyx

​Troubleshooting

​Migrating from another STT provider

​How are you detecting end-of-turn today?

​Which model and settings are you migrating from?

​Speech model comparison

Overview

Quickstart

Parameters reference

Universal 3.5 Pro Realtime parameters

General parameters

Legacy parameters

Turn detection

Pipecat mode (default, recommended)

AssemblyAI’s built-in turn detection

Entity splitting tradeoff

`min_turn_silence` too low

`max_turn_silence` too low

Latency

Latency breakdown

Accuracy

Prompting

Key terms

Conversation context

Seeding the opening greeting

Manual control with `update_agent_context()`

Voice focus

Interruption handling

Dynamic configuration

Speaker diarization

Basic diarization

With custom formatting

Running your agent

Development mode (local audio)

Production with Daily

Telephony with Telnyx

Troubleshooting

Migrating from another STT provider

How are you detecting end-of-turn today?

Which model and settings are you migrating from?

Speech model comparison