> ## Documentation Index
> Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Universal 3.5 Pro Realtime on Pipecat

export const ModelBadges = ({models}) => {
  return <div className="flex flex-wrap gap-2 -mt-3 mb-3 not-prose">
      {models.map(model => <span key={model} className="inline-flex items-center rounded-full bg-green-500/15 px-2.5 py-0.5 text-xs font-mono text-green-700 dark:text-green-400 ring-1 ring-inset ring-green-500/30">
          {model}
        </span>)}
    </div>;
};

## Overview

<ModelBadges models={["universal-3-5-pro"]} />

This guide covers integrating AssemblyAI's **Universal 3.5 Pro Realtime** speech-to-text model into a [Pipecat](https://docs.pipecat.ai/) voice agent. Everything here applies equally to **Universal-3.5 Pro Streaming** (`universal-3-5-pro`) — both belong to the same U3 Pro family and share every parameter in this guide, so you can swap the `model` string without changing anything else.

<Note>
  **Universal 3.5 Pro Realtime is our flagship next-generation streaming model for voice agents** — multilingual and promptable, with [conversation context](#conversation-context) and [voice focus](#voice-focus).

  Available on **Pipecat 1.4.0+** — set `model="universal-3-5-pro"`.
</Note>

AssemblyAI provides the speech-to-text and (optionally) the turn detection in your Pipecat pipeline:

```mermaid theme={null}
flowchart LR
  U["User audio"] --> STT["AssemblyAI STT<br/>Universal 3.5 Pro Realtime"]
  STT --> TD["Turn detection<br/>Pipecat (VAD + Smart Turn)<br/>or AssemblyAI"]
  TD --> LLM["LLM"]
  LLM --> TTS["TTS"]
  TTS --> U
```

Once you have an agent running, tune it for what matters most to your use case:

<CardGroup cols={2}>
  <Card title="Turn detection" icon="comments" href="#turn-detection">
    Decide when the user is done speaking — the two Pipecat modes, defaults, and entity tuning.
  </Card>

  <Card title="Latency" icon="gauge-high" href="#latency">
    Shorten the gap between the user finishing and the agent replying.
  </Card>

  <Card title="Accuracy" icon="bullseye" href="#accuracy">
    Prompting, key terms, conversation context, and noise handling.
  </Card>

  <Card title="Interruptions" icon="hand" href="#interruption-handling">
    Natural barge-in while the agent is speaking.
  </Card>
</CardGroup>

<Card
  title="Pipecat AssemblyAI STT plugin"
  icon={
<svg
  className="card-logo-icon card-logo-icon-pipecat"
  viewBox="0 0 332 332"
  fill="none"
  xmlns="http://www.w3.org/2000/svg"
  aria-hidden="true"
>
  <path d="M45.7718 70.7701C50.4477 69.0096 55.7252 70.3307 59.0204 74.0864L101.936 123.001H230.064L272.98 74.0864C276.275 70.3307 281.552 69.0096 286.228 70.7701C290.904 72.5306 294 77.0042 294 82.0005V190H332V214H270V113.873L244.52 142.915C242.242 145.512 238.955 147.001 235.5 147.001H96.5C93.0452 147.001 89.7581 145.512 87.4796 142.915L62 113.873V214H0V190H38V82.0005C38 77.0042 41.0958 72.5306 45.7718 70.7701Z" fill="currentColor" />
  <path d="M270 238.001H332V262.001H270V238.001Z" fill="currentColor" />
  <path d="M0 238.001H62V262.001H0V238.001Z" fill="currentColor" />
  <path d="M128 198.001C128 206.837 120.837 214.001 112 214.001C103.163 214.001 96 206.837 96 198.001C96 189.164 103.163 182.001 112 182.001C120.837 182.001 128 189.164 128 198.001Z" fill="currentColor" />
  <path d="M236 198.001C236 206.837 228.837 214.001 220 214.001C211.163 214.001 204 206.837 204 198.001C204 189.164 211.163 182.001 220 182.001C228.837 182.001 236 189.164 236 198.001Z" fill="currentColor" />
</svg>
}
  href="https://docs.pipecat.ai/server/services/stt/assemblyai"
>
  View Pipecat's AssemblyAI STT plugin reference.
</Card>

## Quickstart

Get a working, talking agent in a few minutes, then optimize from there.

<Steps>
  <Step title="Install Pipecat">
    Install Pipecat with the AssemblyAI, LLM, and TTS extras you need:

    ```bash theme={null}
    pip install "pipecat-ai[assemblyai,openai,cartesia]" python-dotenv
    ```

    **What's included:**

    * `assemblyai`: AssemblyAI U3 Pro STT service
    * `openai`: OpenAI LLM service (used in the example)
    * `cartesia`: Cartesia TTS service (used in the example)

    <Tip>
      The example uses OpenAI and Cartesia, but you can use any LLM or TTS supported
      by Pipecat — just swap the extras (e.g.,
      `pipecat-ai[assemblyai,anthropic,elevenlabs]`).
    </Tip>

    <Warning>
      Universal 3.5 Pro Realtime, automatic [conversation context](#conversation-context),
      and [Voice Focus](#voice-focus) require **`pipecat-ai` 1.4.0+**. Older versions
      won't recognize the `universal-3-5-pro` model.
    </Warning>
  </Step>

  <Step title="Set your API keys">
    Set your API keys in a `.env` file:

    ```env theme={null}
    ASSEMBLYAI_API_KEY=your_assemblyai_key
    OPENAI_API_KEY=your_openai_key
    CARTESIA_API_KEY=your_cartesia_key
    ```

    <Tip>
      You can obtain an AssemblyAI API key by signing up
      [here](https://www.assemblyai.com/dashboard/signup) and navigating to the [API
      Keys tab](https://www.assemblyai.com/dashboard/home) of the dashboard.
    </Tip>
  </Step>

  <Step title="Build a minimal agent">
    The example below uses Pipecat-controlled turn detection (the default). Pay attention to the comments for switching to AssemblyAI's built-in turn detection, and note that the **assistant aggregator** at the end of the pipeline is what enables automatic [conversation context](#conversation-context).

    ```python expandable theme={null}
    import os

    from dotenv import load_dotenv
    from loguru import logger

    from pipecat.audio.vad.silero import SileroVADAnalyzer
    from pipecat.frames.frames import LLMRunFrame
    from pipecat.pipeline.pipeline import Pipeline
    from pipecat.pipeline.worker import PipelineParams, PipelineWorker
    from pipecat.processors.aggregators.llm_context import LLMContext
    from pipecat.processors.aggregators.llm_response_universal import (
        LLMContextAggregatorPair,
        LLMUserAggregatorParams,
    )
    from pipecat.runner.types import RunnerArguments
    from pipecat.runner.utils import create_transport
    from pipecat.services.assemblyai.stt import AssemblyAISTTService
    from pipecat.services.cartesia.tts import CartesiaTTSService
    from pipecat.services.openai.llm import OpenAILLMService
    from pipecat.transports.base_transport import BaseTransport, TransportParams
    from pipecat.transports.daily.transport import DailyParams
    from pipecat.workers.runner import WorkerRunner

    load_dotenv()

    transport_params = {
        "daily": lambda: DailyParams(audio_in_enabled=True, audio_out_enabled=True),
        "webrtc": lambda: TransportParams(audio_in_enabled=True, audio_out_enabled=True),
    }


    async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
        stt = AssemblyAISTTService(
            api_key=os.environ["ASSEMBLYAI_API_KEY"],
            settings=AssemblyAISTTService.Settings(
                model="universal-3-5-pro",
                min_turn_silence=100,
                # max_turn_silence is auto-synced to min_turn_silence in Pipecat mode.
                # vad_threshold=0.3,            # Align with your local VAD's threshold
                # continuous_partials=True,     # Default — steady ~3s partials during long turns
                # interruption_delay=0,         # Optional: faster first partial (~300ms effective)
            ),
            vad_force_turn_endpoint=True,  # Pipecat mode (default).
            # Set False to use AssemblyAI's built-in turn detection (universal-3-5-pro / universal-3-5-pro only):
            # vad_force_turn_endpoint=False,
        )

        llm = OpenAILLMService(api_key=os.environ["OPENAI_API_KEY"])
        tts = CartesiaTTSService(api_key=os.environ["CARTESIA_API_KEY"])

        context = LLMContext()
        user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
            context,
            user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
        )

        pipeline = Pipeline(
            [
                transport.input(),     # Transport user input
                stt,                   # STT
                user_aggregator,       # User responses
                llm,                   # LLM
                tts,                   # TTS
                transport.output(),    # Transport bot output
                assistant_aggregator,  # Assistant responses → automatic conversation context
            ]
        )

        worker = PipelineWorker(pipeline, params=PipelineParams(enable_metrics=True))

        @transport.event_handler("on_client_connected")
        async def on_client_connected(transport, client):
            context.add_message(
                {"role": "system", "content": "You are a helpful voice assistant. Keep replies brief and speakable."}
            )
            await worker.queue_frames([LLMRunFrame()])

        runner = WorkerRunner(handle_sigint=runner_args.handle_sigint)
        await runner.add_workers(worker)
        await runner.run()


    async def bot(runner_args: RunnerArguments):
        transport = await create_transport(runner_args, transport_params)
        await run_bot(transport, runner_args)


    if __name__ == "__main__":
        from pipecat.runner.run import main

        main()
    ```

    <Tip>
      Two complete, runnable examples live in the Pipecat repo:
      [voice-assemblyai.py](https://github.com/pipecat-ai/pipecat/blob/main/examples/voice/voice-assemblyai.py)
      (Pipecat turn detection) and
      [voice-assemblyai-turn-detection.py](https://github.com/pipecat-ai/pipecat/blob/main/examples/voice/voice-assemblyai-turn-detection.py)
      (AssemblyAI's built-in turn detection).
    </Tip>
  </Step>

  <Step title="Run and test">
    Run the agent directly with local audio:

    ```bash theme={null}
    python your_agent.py
    ```

    Speak into your microphone after hearing the greeting. For WebRTC or Daily testing, see [Running your agent](#running-your-agent).
  </Step>
</Steps>

## Parameters reference

### **Universal 3.5 Pro Realtime** parameters

These are the key parameters to tune. Set them inside `AssemblyAISTTService.Settings(...)`. They apply to the whole U3 Pro family (`universal-3-5-pro` and `universal-3-5-pro`).

<ParamField path="model" type="str" default="universal-3-5-pro">
  The streaming model. `"universal-3-5-pro"` is the recommended flagship model;
  the plugin currently defaults to `"universal-3-5-pro"`, so set `model` explicitly. Both
  belong to the U3 Pro family and share every parameter below.
</ParamField>

<ParamField path="mode" type="str">
  Accuracy/latency preset: `"min_latency"`, `"balanced"`, or `"max_accuracy"`.
  Sets sensible defaults for mode-dependent fields; any value you set explicitly
  still takes precedence. The server defaults to `"balanced"`. Construction-time
  only. U3 Pro family only. See [Optimizing accuracy and
  latency](/streaming/getting-started/optimizing-accuracy-and-latency).
</ParamField>

<ParamField path="keyterms_prompt" type="list[str]">
  List of terms to boost recognition for. Used on its own, your terms are
  appended to the default prompt automatically. Can't be set in the same request
  as `prompt` — see [Key terms](#key-terms) to combine boosting with a custom
  prompt.
</ParamField>

<ParamField path="prompt" type="str">
  Contextual prompt — a natural-language description of what the audio is about
  (domain, scenario, or full details). Can't be set in the same request as
  `keyterms_prompt`; fold the terms into the prompt text instead (see
  [Key terms](#key-terms)). **Prompting is currently a beta feature**: see
  [Prompting](/streaming/prompting-and-keyterms) for more information.
</ParamField>

<ParamField path="agent_context" type="str">
  Context carryover seed — your agent's most recent spoken reply, up to \~1500
  characters, used to transcribe the next user turn more accurately. Set it at
  construction time to seed an opening greeting; later turns are fed
  automatically. U3 Pro family only. See [Conversation
  context](#conversation-context).
</ParamField>

<ParamField path="previous_context_n_turns" type="int" default="3">
  How many prior conversation entries are carried forward automatically. Range
  `0`–`100`; `0` disables carryover entirely (including the automatic
  `agent_context` feed). Construction-time only; leave unset for the server
  default (`3`). U3 Pro family only.
</ParamField>

<ParamField path="min_turn_silence" type="int" default="100">
  Milliseconds of silence before a speculative end-of-turn check. When the check
  fires, the model looks for terminal punctuation (`.` `?` `!`) to decide whether
  the turn has ended. (Formerly `min_end_of_turn_silence_when_confident`,
  deprecated but still supported with a warning.)
</ParamField>

<ParamField path="max_turn_silence" type="int" default="1000">
  Maximum silence before the turn is forced to end, regardless of punctuation.
  **Auto-synced to `min_turn_silence` in Pipecat mode**; respected as configured
  in AssemblyAI's built-in turn detection mode.
</ParamField>

<ParamField path="vad_threshold" type="float" default="0.3">
  AssemblyAI's internal VAD threshold (`0.0`–`1.0`) for classifying audio frames
  as silence. Align with your local VAD's activation threshold to avoid a "dead
  zone" where AssemblyAI transcribes speech your VAD hasn't detected yet.
</ParamField>

<ParamField path="voice_focus" type="str">
  Server-side noise suppression that isolates the primary speaker.
  `"near-field"` for close-talking mics, `"far-field"` for distant capture.
  Construction-time only. U3 Pro family only. See [Voice focus](#voice-focus).
</ParamField>

<ParamField path="voice_focus_threshold" type="float">
  How aggressively `voice_focus` suppresses background audio. `0.0`–`1.0`; higher
  is more aggressive. Only takes effect when `voice_focus` is set.
  Construction-time only. U3 Pro family only.
</ParamField>

<ParamField path="continuous_partials" type="bool" default="True">
  Whether to emit additional partial transcripts during long turns at a steady
  \~3 second cadence. When enabled (default on both the API and this plugin),
  additional partials covering the full turn transcript are emitted
  approximately every 3 seconds while speech continues. When disabled, only one
  early partial is emitted near turn start. The first partial (at 750ms) is
  unaffected. Useful when downstream consumers (LLMs, UI, eager inference) need
  frequent updates during long, uninterrupted turns. See
  [Continuous partials](/streaming/getting-started/transcribe-streaming-audio)
  for details.
</ParamField>

<ParamField path="interruption_delay" type="int" default="500">
  How soon the first partial transcript is emitted during a turn, in
  milliseconds. Range: `0`–`1000`. Lower values produce faster time to first
  token (TTFT) for barge-in and speculative inference; higher values produce
  more confident first partials. The server adds a minimum of 300ms on top of
  the configured value (`interruption_delay=0` → \~300ms effective,
  `interruption_delay=500` → \~800ms effective). See
  [Tuning early partial timing](/streaming/getting-started/transcribe-streaming-audio)
  for details.
</ParamField>

<ParamField path="language_detection" type="bool">
  **Universal 3.5 Pro Realtime** code-switches natively between supported
  languages. This parameter controls whether `language_code` and
  `language_confidence` are included in turn messages.
</ParamField>

<ParamField path="speaker_labels" type="bool" default="False">
  Enable speaker diarization. See [Speaker diarization](#speaker-diarization).
</ParamField>

### General parameters

These apply across models and Pipecat setups. `api_key`, `vad_force_turn_endpoint`, `should_interrupt`, and `speaker_format` are passed directly to `AssemblyAISTTService(...)`, not inside `Settings`.

<ParamField path="api_key" type="str" required>
  Your AssemblyAI API key.
</ParamField>

<ParamField path="vad_force_turn_endpoint" type="bool" default="True">
  `True` for Pipecat mode (VAD + Smart Turn controls turns); `False` for
  AssemblyAI's built-in turn detection (`universal-3-5-pro` / `universal-3-5-pro` only).
  See [Turn detection](#turn-detection).
</ParamField>

<ParamField path="should_interrupt" type="bool" default="True">
  Whether the user starting to speak interrupts the bot. Only applies in
  AssemblyAI's built-in turn detection mode (`vad_force_turn_endpoint=False`).
</ParamField>

<ParamField path="speaker_format" type="str">
  Template string for formatting speaker labels (e.g., `"[{speaker}] {text}"`).
  Used with `speaker_labels`.
</ParamField>

<ParamField path="sample_rate" type="int" default="16000">
  The sample rate of the audio stream.
</ParamField>

<ParamField path="encoding" type="str" default="pcm_s16le">
  The encoding of the audio stream. Allowed values: `pcm_s16le`, `pcm_mulaw`.
</ParamField>

### Legacy parameters

These apply to the `universal-streaming-english` and `universal-streaming-multilingual` models, but **do not affect Universal 3.5 Pro Realtime or `universal-3-5-pro`**:

<ParamField path="end_of_turn_confidence_threshold" type="float">
  Confidence threshold for end-of-turn detection. The U3 Pro family uses
  punctuation-based turn detection instead, so this parameter has no effect.
</ParamField>

<ParamField path="format_turns" type="bool" default="True">
  Whether to return formatted final transcripts. The U3 Pro family always
  returns formatted transcripts, so this parameter no longer applies.
</ParamField>

## Turn detection

In Pipecat, you choose **which component decides when the user is done speaking** with the `vad_force_turn_endpoint` flag on `AssemblyAISTTService`. The U3 Pro family uses a **punctuation-based** end-of-turn system: after a period of silence, the model checks for terminal punctuation (`.` `?` `!`) rather than a confidence score. For more on how this works, see [Configuring turn detection](/streaming/getting-started/transcribe-streaming-audio).

<Note>
  The `vad_force_turn_endpoint` parameter controls which turn detection mode is
  used. It defaults to `True` (Pipecat mode), which sends a `ForceEndpoint`
  message to AssemblyAI when the local VAD detects silence. Set it to `False` to
  use AssemblyAI's built-in turn detection instead. Choosing the right mode is
  critical for balancing responsiveness and turn accuracy in your voice agent.
</Note>

### Pipecat mode (default, recommended)

**When to use:** Most voice agent applications requiring responsive interruptions.

```python theme={null}
stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        min_turn_silence=100,
    ),
    vad_force_turn_endpoint=True,  # Default (Pipecat mode)
)
```

**How it works:**

* VAD + the Smart Turn analyzer control when the user is done speaking.
* A `ForceEndpoint` message is sent to AssemblyAI on VAD silence detection.
* `max_turn_silence` is **automatically synchronized** with `min_turn_silence`.
* Best for low-latency, responsive voice agents.

### AssemblyAI's built-in turn detection

**When to use:** When you want AssemblyAI's punctuation-based turn detection to control turn endings, configured through the settings below.

```python theme={null}
stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        min_turn_silence=100,
        max_turn_silence=1000,  # Now respected independently
    ),
    vad_force_turn_endpoint=False,  # AssemblyAI's built-in turn detection
)
```

**How it works:**

1. User speaks → audio streams to AssemblyAI.
2. User pauses for `min_turn_silence` (e.g., `100ms`) → the model checks for terminal punctuation.
3. If terminal punctuation (`.` `?` `!`) is found → the turn ends immediately.
4. If not → a partial is emitted and the turn continues waiting.
5. If silence reaches `max_turn_silence` (e.g., `1000ms`) → the turn is forced to end regardless.

In this mode all timing parameters are respected as configured, the service emits `UserStartedSpeakingFrame` / `UserStoppedSpeakingFrame`, and `SpeechStarted` events drive fast barge-in. Only available with `universal-3-5-pro` / `universal-3-5-pro` (other models require Pipecat mode).

### Entity splitting tradeoff

Lower `min_turn_silence` and `max_turn_silence` values produce faster transcripts but can split entities or utterances across turns. The two parameters affect this differently.

#### `min_turn_silence` too low

The speculative check fires too early, splitting entities on punctuation:

```text theme={null}
# With (min_turn_silence=100, max_turn_silence=1000)
"It's John."                    → FINAL (100ms pause, check fires, period found → turn ends)
"Smith."                        → FINAL
"At gmail.com."                 → FINAL

# With (min_turn_silence=400, max_turn_silence=1000)
"It's john.smith@gmail.com."    → FINAL (single turn, properly formatted)
```

#### `max_turn_silence` too low

The forced turn-end cuts off the user mid-thought:

```text theme={null}
# With (min_turn_silence=100, max_turn_silence=1000)
"I wanted to check on my order from..."  → FINAL (1000ms silence, forced end)
"last Tuesday, order number 4829."       → FINAL (new turn)

# With (min_turn_silence=100, max_turn_silence=2000)
"I wanted to check on my order from last Tuesday, order number 4829."  → FINAL (single turn)
```

<Note>
  **Universal 3.5 Pro Realtime's** formatting is significantly better when it has
  full context in a single turn — email addresses, phone numbers, credit card
  numbers, and physical addresses all benefit. If your use case involves
  alphanumeric dictation, raise `max_turn_silence` during those portions of the
  conversation (e.g., to `2000`–`4000` ms) using [dynamic
  configuration](#dynamic-configuration), then lower it again afterward. In
  Pipecat mode, raise `min_turn_silence` (which `max_turn_silence` follows) for
  the same effect.
</Note>

## Latency

A voice agent feels responsive when the gap between the user finishing and the agent replying is short.

Start with the **`mode` preset** — the highest-level dial for the accuracy/latency trade-off. It sets sensible defaults for the fine-grained levers below, so you can pick a target and tune from there:

```python theme={null}
stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        mode="balanced",  # "min_latency" (fastest) · "balanced" · "max_accuracy" (best quality)
    ),
)
```

`mode` is set at construction time (it can't be changed mid-session) and influences the defaults of the levers below. Any value you set explicitly still wins. Leave it unset to use the server's default preset. See [Optimizing accuracy and latency](/streaming/getting-started/optimizing-accuracy-and-latency).

From there, fine-tune the individual levers:

* **End-of-turn timing.** `min_turn_silence` (speculative check) and `max_turn_silence` (forced end) directly control how soon a turn ends. Lower is faster but risks splitting entities — see [Turn detection](#turn-detection).
* **Time to first partial.** `interruption_delay` controls how soon the first partial is emitted, which drives faster barge-in and speculative inference. The server adds a minimum of `300ms` on top of the configured value.
* **Sample rate.** Use 16 kHz (`sample_rate=16000`). Higher rates don't improve accuracy and only add bandwidth.
* **Continuous partials.** `continuous_partials` (on by default) emits a partial every \~3 seconds during long turns. Leave it on for steady mid-turn updates, or disable it if you only need a single early partial.
* **Skip client-side preprocessing.** Don't run your own noise cancellation before audio reaches the model — the artifacts it introduces usually hurt accuracy more than the original noise. Use server-side [Voice Focus](#voice-focus) instead.

### Latency breakdown

| Stage                                    | Typical                                               | Controlled by        |
| ---------------------------------------- | ----------------------------------------------------- | -------------------- |
| Network round trip                       | \~50 ms                                               | —                    |
| Speech-to-text                           | \~200–300 ms                                          | model                |
| First partial (TTFT)                     | configured `interruption_delay` + \~300 ms server min | `interruption_delay` |
| End of turn (terminal punctuation found) | `min_turn_silence` (default 100 ms)                   | `min_turn_silence`   |
| End of turn (no punctuation, forced)     | up to `max_turn_silence`                              | `max_turn_silence`   |

## Accuracy

**Universal 3.5 Pro Realtime** is accurate out of the box. When you need more — domain vocabulary, proper nouns, noisy audio — reach for these levers. For entity-heavy dictation, also tune turn detection (see [Entity splitting tradeoff](#entity-splitting-tradeoff)), and note that the high-level [`mode` preset](#latency) shifts the overall accuracy/latency balance (use `max_accuracy` to favor quality).

### Prompting

<Warning>
  **Beta feature**

  Prompting is considered a beta feature for **Universal 3.5 Pro Realtime**.

  While it can be a powerful tool for improving accuracy in certain use cases,
  **we recommend starting without a `prompt` to first establish baseline
  performance.** Once the baseline has been tested, you can add context to
  further optimize for your use case (e.g., language mix to expect, use case or
  domain).
</Warning>

**Universal 3.5 Pro Realtime** supports a `prompt` parameter for [contextual prompting](/streaming/prompting-and-keyterms) — a description of what the audio is about. Transcription behavior (verbatim output, punctuation, turn detection) is built in and optimized automatically; the prompt carries context, not instructions.

```python theme={null}
stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        prompt="Customer support call about an internet service outage.",
    ),
)
```

### Key terms

Use `keyterms_prompt` to boost recognition of specific names, brands, or domain terms. On its own, your terms are appended to the default prompt automatically — so you get boosting and prompting together:

```python theme={null}
stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        keyterms_prompt=["Xiomara", "Saoirse", "Pipecat", "AssemblyAI"],
    ),
)
```

<Note>
  You can't pass `prompt` and `keyterms_prompt` in the same request — doing so
  raises a validation error. You don't have to give up term boosting to use a
  contextual prompt, though. Either:

  * Pass **`keyterms_prompt` on its own** — your terms are appended to the
    default prompt automatically, or
  * Fold the terms into a custom **`prompt`**, e.g. end it with
    `"Make sure to boost the words Xiomara, Saoirse, Pipecat in the audio."`
</Note>

### Conversation context

Give the model both sides of the dialog so it transcribes the next user turn more accurately. **Universal 3.5 Pro Realtime** keeps a short, per-session memory of the conversation from two sources:

* **The agent half** — what your agent just said.
* **The user half** — prior STT-finalized user turns.

With the agent's question in context, the model can anticipate the answer, sharpen entity recognition, and disambiguate similar-sounding words. For example, after your agent asks `"What's your email address?"`, the model can produce `"user@assemblyai.com"` instead of `"user at assemblyai dot com"`. This has the biggest impact on short replies (`"yes"`, `"7pm"`, single names) and spelled-out entities. See [Conversation context](/streaming/universal-3-pro/context-carryover) for the full reference.

<Note>
  **In Pipecat, conversation context is automatic — no event wiring required.**
  As long as your pipeline includes the standard LLM context aggregator (the
  `assistant_aggregator` from `LLMContextAggregatorPair`), Pipecat broadcasts an
  `LLMContextAssistantTurnFrame` when each bot turn completes, and
  `AssemblyAISTTService` feeds that reply to the model as `agent_context`
  automatically. Just use a U3 Pro family model on `pipecat-ai` 1.4.0+.
</Note>

| Parameter                  | Type | Description                                                                                                                                                             |
| -------------------------- | ---- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `agent_context`            | str  | Your agent's most recent spoken reply, up to \~1500 characters. Set it at construction time to seed an opening greeting; subsequent replies are fed automatically.      |
| `previous_context_n_turns` | int  | How many prior conversation entries are carried forward automatically. Range `0`–`100`; `0` disables carryover entirely. Construction-time only; server default is `3`. |

#### Seeding the opening greeting

The automatic feed kicks in once your agent completes its first turn. To give the model context for the user's *very first* reply (the answer to your greeting), set `agent_context` at construction time:

```python theme={null}
stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        # Seed the opening line; later turns are fed automatically by the aggregator.
        agent_context="Hi! Thanks for calling Acme. What's the email on your account?",
        # previous_context_n_turns=3,  # Default. Set 0 to disable carryover entirely.
    ),
)
```

#### Manual control with `update_agent_context()`

If your pipeline doesn't use the standard LLM context aggregator, or you want explicit control over what the model sees, push the agent's reply yourself. This is a **live update — no reconnect required**:

```python theme={null}
# Whenever your agent finishes speaking:
await stt.update_agent_context("Your account is past due. Would you like to pay now?")
```

<Note>
  `agent_context`, `previous_context_n_turns`, and `update_agent_context()` are
  supported only on the U3 Pro family (`universal-3-5-pro`, `universal-3-5-pro`). Values
  are clipped to \~1500 characters and re-seeded automatically on reconnect.
  Setting `previous_context_n_turns=0` disables the automatic feed as well.
</Note>

### Voice focus

Voice Focus isolates the primary speaker and suppresses background noise — chatter, keyboard clicks, fan hum, room echo — **server-side, before audio reaches the model**. Use it instead of client-side noise cancellation, which tends to introduce artifacts that hurt accuracy more than the noise itself.

| Parameter               | Type  | Description                                                                                                                                      |
| ----------------------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
| `voice_focus`           | str   | `"near-field"` for headsets, handsets, and other close-talking mics; `"far-field"` for conference rooms, laptop mics, and other distant capture. |
| `voice_focus_threshold` | float | Optional. `0.0`–`1.0`; higher values suppress background audio more aggressively.                                                                |

Both are construction-time parameters on the U3 Pro family. See [Voice Focus](/streaming/voice-focus) for details.

```python theme={null}
stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        voice_focus="far-field",    # "near-field" for close-talking mics
        voice_focus_threshold=0.5,  # Optional: 0.0–1.0, higher = more aggressive
    ),
)
```

## Interruption handling

Barge-in — the user interrupting while the agent is speaking — is handled by Pipecat, and the signals that drive it depend on your turn detection mode.

* **Pipecat mode (`vad_force_turn_endpoint=True`).** Pipecat's local VAD and the Smart Turn analyzer detect the user starting to speak and interrupt the bot's TTS. AssemblyAI also emits `SpeechStarted` events as a backstop.
* **AssemblyAI's built-in turn detection (`vad_force_turn_endpoint=False`).** The service emits `UserStartedSpeakingFrame` / `UserStoppedSpeakingFrame` and uses AssemblyAI's `SpeechStarted` events for fast barge-in. Set `should_interrupt=False` (constructor argument) to disable barge-in entirely in this mode.

```json theme={null}
{"type": "SpeechStarted", "timestamp": 14400, "confidence": 0.79}
```

On detection, Pipecat stops TTS playback and switches to listening. To reduce false interruptions from short backchannels (`"mhm"`, `"yeah"`, `"okay"`), keep your VAD threshold aligned with `vad_threshold` and lean on Pipecat's Smart Turn analyzer, which evaluates whether speech is a genuine turn rather than a filler.

## Dynamic configuration

Update settings mid-conversation by queueing an `STTUpdateSettingsFrame` with a settings delta — adapt to the conversation stage as it unfolds. See [stt-assemblyai.py](https://github.com/pipecat-ai/pipecat/blob/main/examples/update-settings/stt/stt-assemblyai.py) for a complete working example.

```python expandable theme={null}
from pipecat.frames.frames import STTUpdateSettingsFrame
from pipecat.services.assemblyai.stt import AssemblyAISTTService

# Update keyterms during the conversation
await worker.queue_frame(
    STTUpdateSettingsFrame(
        delta=AssemblyAISTTService.Settings(
            keyterms_prompt=["NewName", "NewCompany"],
        )
    )
)

# Widen the silence window during entity dictation
await worker.queue_frame(
    STTUpdateSettingsFrame(
        delta=AssemblyAISTTService.Settings(
            min_turn_silence=200,
            max_turn_silence=3000,  # Respected in AssemblyAI's built-in turn detection mode
        )
    )
)
```

<Warning>
  **`agent_context` is the only setting applied live.** Changing any other
  setting via `STTUpdateSettingsFrame` reconnects the AssemblyAI session to
  apply it (a brief interruption). To push conversation context without a
  reconnect, use the dedicated `stt.update_agent_context(...)` method — see
  [Conversation context](#conversation-context).
</Warning>

| Conversation stage                         | Adjustment                                                                    |
| ------------------------------------------ | ----------------------------------------------------------------------------- |
| Caller identification (names, account IDs) | Boost terms with `keyterms_prompt`                                            |
| Entity dictation (email, phone, address)   | Raise `max_turn_silence` to \~`2000`–`4000` ms, then lower it again afterward |
| After each agent reply                     | Automatic — or push `agent_context` via `update_agent_context()`              |
| Faster barge-in                            | Lower `interruption_delay`                                                    |

For more information, see [Updating configuration mid-stream](/streaming/updating-configuration-mid-stream).

## Speaker diarization

Identify different speakers in multi-party conversations.

### Basic diarization

```python theme={null}
stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        speaker_labels=True,
    ),
)
```

Speaker labels (e.g., `"A"`, `"B"`, `"C"`) are included in final transcripts.

### With custom formatting

Format transcripts with speaker labels for LLM context:

```python theme={null}
stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        speaker_labels=True,
    ),
    speaker_format="<{speaker}>{text}</{speaker}>",
)
```

**Format options:**

| Style    | Format string                   |
| -------- | ------------------------------- |
| XML      | `<{speaker}>{text}</{speaker}>` |
| Markdown | `**{speaker}**: {text}`         |
| Bracket  | `[{speaker}] {text}`            |

## Running your agent

### Development mode (local audio)

```bash theme={null}
python your_agent.py
```

Speak into your microphone after hearing the greeting.

### Production with Daily

For production deployments, use the [Daily transport](https://docs.pipecat.ai/server/services/transport/daily) for WebRTC-based real-time audio/video. Your agent joins a Daily room as a participant and handles audio I/O through Daily's infrastructure.

### Telephony with Telnyx

When bridging phone calls through Pipecat (e.g., via Telnyx), the audio is 8 kHz, not 16 kHz. Match the transport sample rates:

```python theme={null}
transport = TelnyxTransport(
    # ...
    audio_in_sample_rate=8000,
    audio_out_sample_rate=8000,
)
```

## Troubleshooting

| Issue                                  | Cause                                             | Solution                                                                                     |
| -------------------------------------- | ------------------------------------------------- | -------------------------------------------------------------------------------------------- |
| `universal-3-5-pro` not recognized     | `pipecat-ai` older than 1.4.0                     | Upgrade: `pip install -U "pipecat-ai[assemblyai]"`                                           |
| Turn over-segmentation                 | `min_turn_silence` too low                        | Increase from `100` to `200`–`500`                                                           |
| Entities split across turns            | `max_turn_silence` too low (AssemblyAI mode)      | Increase `max_turn_silence` (e.g., `1500`–`3500`); in Pipecat mode, raise `min_turn_silence` |
| Latency on non-terminal utterances     | `max_turn_silence` too high                       | Lower `max_turn_silence`                                                                     |
| Conversation context has no effect     | Non-U3-Pro model, or `previous_context_n_turns=0` | Use a U3 Pro family model and leave `previous_context_n_turns` unset (or > 0)                |
| Mid-session setting change drops audio | Reconnect on a non-`agent_context` setting change | Expected — only `agent_context` updates live; use `update_agent_context()` for context       |
| Mis-heard names, brands, or jargon     | No vocabulary hints                               | Add `keyterms_prompt`, or supply `prompt`/`agent_context` for context                        |
| Poor accuracy in noisy audio           | Background noise or room echo                     | Enable `voice_focus` (`near-field` or `far-field`)                                           |

## Migrating from another STT provider

To balance accuracy, latency, turn-taking, and interruption handling, map your current setup to AssemblyAI using the questions below.

### How are you detecting end-of-turn today?

| Today                                        | Recommended on AssemblyAI                                                                                                   |
| -------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- |
| Your STT provider's own end-of-turn model    | AssemblyAI's built-in turn detection: `vad_force_turn_endpoint=False` with `min_turn_silence=100`, `max_turn_silence=1000`. |
| Silence / VAD only, with your own turn logic | Pipecat mode (`vad_force_turn_endpoint=True`, default). VAD + Smart Turn decide turns; AssemblyAI returns finals ASAP.      |
| You want the framework to own turn-taking    | Pipecat mode (default) — Pipecat's Smart Turn analyzer makes the turn decision.                                             |

### Which model and settings are you migrating from?

| What you pass today                        | AssemblyAI equivalent                                                                                 |
| ------------------------------------------ | ----------------------------------------------------------------------------------------------------- |
| Current model (Deepgram, ElevenLabs, etc.) | `model="universal-3-5-pro"` (recommended flagship) or `"universal-3-5-pro"`                           |
| Overall accuracy/latency tuning            | `mode="min_latency"` / `"balanced"` / `"max_accuracy"` — a one-line starting point before fine-tuning |
| Endpointing / silence thresholds           | `min_turn_silence` (speculative end-of-turn) and `max_turn_silence` (forced end)                      |
| Custom vocabulary / keywords               | `keyterms_prompt=[...]`; broader domain context → `prompt`                                            |
| Provider-side conversation context         | Automatic — include the LLM context aggregator; seed greetings via `agent_context`                    |
| Formatting / punctuation toggles           | On by default — formatted transcripts always (`format_turns` does not apply)                          |
| Telephony / SIP routing                    | `sample_rate=8000` and `encoding="pcm_mulaw"` for 8 kHz telephony                                     |
| Client-side noise cancellation             | Drop it; use server-side [Voice Focus](#voice-focus) instead                                          |

Migrating a production deployment? [Talk to our team](https://www.assemblyai.com/contact/sales).

## Speech model comparison

Interested in using a different model?

| Feature                            | U3 Pro family <br />(`universal-3-5-pro`, `universal-3-5-pro`) | universal-streaming-english | universal-streaming-multilingual |
| ---------------------------------- | -------------------------------------------------------------- | --------------------------- | -------------------------------- |
| ***Turn Detection Modes***         |                                                                |                             |                                  |
| Pipecat mode (VAD + Smart Turn)    | ✅                                                              | ✅                           | ✅                                |
| AssemblyAI turn detection mode     | ✅                                                              | ❌                           | ❌                                |
| ***Turn Detection Parameters***    |                                                                |                             |                                  |
| `min_turn_silence`                 | ✅                                                              | ✅                           | ✅                                |
| `max_turn_silence`                 | ✅                                                              | ✅                           | ✅                                |
| `end_of_turn_confidence_threshold` | ❌                                                              | ✅ (1.0)                     | ✅ (1.0)                          |
| `continuous_partials`              | ✅                                                              | ❌                           | ❌                                |
| `interruption_delay`               | ✅                                                              | ❌                           | ❌                                |
| ***Advanced Features***            |                                                                |                             |                                  |
| Keyterms boosting                  | ✅                                                              | ✅                           | ✅                                |
| Custom prompting (beta)            | ✅                                                              | ❌                           | ❌                                |
| Conversation context (carryover)   | ✅                                                              | ❌                           | ❌                                |
| Voice Focus                        | ✅                                                              | ❌                           | ❌                                |
| Speaker diarization                | ✅                                                              | ✅                           | ✅                                |
| Dynamic parameter updates          | ✅                                                              | ✅                           | ✅                                |
| ***Language Support***             |                                                                |                             |                                  |
| Multilingual code switching        | ✅                                                              | ❌                           | ✅                                |
| Language detection                 | ✅                                                              | ❌                           | ✅                                |

**Legend:**

* ✅ Fully supported and recommended
* ❌ Not supported / Not used

<Note>
  **The U3 Pro family is recommended** for all new voice agent implementations.
  The universal-streaming models are maintained for backward compatibility but
  lack the optimizations and features specifically designed for real-time
  conversational AI.
</Note>

<Warning>
  The `end_of_turn_confidence_threshold` parameter is **not used** with the U3
  Pro family (it won't affect behavior). For universal-streaming models, Pipecat
  automatically sets it to `1.0` in Pipecat mode to disable semantic turn
  detection and ensure fast responses. You don't need to configure this
  parameter manually.
</Warning>
