Universal-3 Pro on Pipecat

Overview

This guide covers integrating AssemblyAI’s Universal-3 Pro (u3-rt-pro) streaming speech-to-text model into Pipecat voice agents.

Universal-3 Pro for streaming is optimized for real-time audio utterances typically under 10 seconds, with special efficiencies built in for low-latency turn detection and voice agent workflows. It provides the highest accuracy with native multilingual code switching, entity accuracy, and prompting support.

Universal-3 Pro streaming delivers exceptional entity and alphanumeric accuracy, including credit card numbers, cell phone numbers, email addresses, physical addresses, and names — all with sub-300ms time to complete transcript latency.

Key features

  • Two-mode turn detection: Choose between Pipecat-controlled (VAD + Smart Turn) or AssemblyAI’s built-in turn detection (STT-based)
  • Keyterms boosting: Improve recognition of specific words and names
  • Dynamic parameter updates: Change configuration mid-conversation without reconnection
  • Speaker diarization: Identify and label different speakers in multi-party conversations

Examples

Complete working examples are available in the Pipecat repository:

You can run any example directly as long as your API keys are saved in a .env file:

$python 07o-interruptible-assemblyai.py

The vad_force_turn_endpoint parameter controls which turn detection mode is used. It defaults to True (Pipecat mode), which sends a ForceEndpoint message to AssemblyAI when the local VAD detects silence. Set it to False to use AssemblyAI’s built-in turn detection instead. Choosing the right mode is critical for balancing responsiveness and turn accuracy in your voice agent.

Installation

Install Pipecat with all required dependencies:

$pip install "pipecat-ai[assemblyai,openai,cartesia]"

What’s included:

  • assemblyai — AssemblyAI U3-Pro STT service
  • openai — OpenAI LLM service (used in the examples)
  • cartesia — Cartesia TTS service (used in the examples)

The examples use OpenAI and Cartesia, but you can use any LLM or TTS you want that’s supported by Pipecat. Just swap out the extras in the install command (e.g., pipecat-ai[assemblyai,anthropic,elevenlabs]).

Authentication

Set your API keys in a .env file:

1ASSEMBLYAI_API_KEY=your_assemblyai_key
2OPENAI_API_KEY=your_openai_key
3CARTESIA_API_KEY=your_cartesia_key

You can obtain an AssemblyAI API key by signing up here.

Two-mode turn detection

Within Pipecat, you have two distinct approaches to turn detection with AssemblyAI’s U3-Pro model.

When to use: Most voice agent applications requiring responsive interruptions.

1stt = AssemblyAISTTService(
2 api_key=os.getenv("ASSEMBLYAI_API_KEY"),
3 connection_params=AssemblyAIConnectionParams(
4 speech_model="u3-rt-pro",
5 min_turn_silence=100,
6 ),
7 vad_force_turn_endpoint=True, # Default — Pipecat mode
8)

How it works:

  • VAD + Smart Turn analyzer controls when the user is done speaking
  • ForceEndpoint message sent to AssemblyAI on VAD silence detection
  • max_turn_silence automatically synchronized with min_turn_silence
  • Best for low-latency, responsive voice agents

AssemblyAI’s built-in turn detection (STT mode)

When to use: When you want AssemblyAI’s built-in turn detection to control turn endings. This mode is configurable within the connection parameters — see Configuring turn detection to understand how it works.

1stt = AssemblyAISTTService(
2 api_key=os.getenv("ASSEMBLYAI_API_KEY"),
3 connection_params=AssemblyAIConnectionParams(
4 speech_model="u3-rt-pro",
5 min_turn_silence=100,
6 max_turn_silence=1000, # Now respected
7 ),
8 vad_force_turn_endpoint=False, # AssemblyAI's built-in turn detection (STT mode)
9)

How it works:

  • AssemblyAI’s built-in turn detection controls when the user is done speaking
  • All timing parameters are respected as configured
  • Emits UserStartedSpeakingFrame / UserStoppedSpeakingFrame
  • Uses SpeechStarted events for fast barge-in
  • Only available with u3-rt-pro (other models require Pipecat mode)

AssemblyAI’s built-in turn detection uses the STT model’s understanding of speech patterns to determine turn boundaries, rather than relying on local VAD silence detection.

Keyterms boosting

Improve recognition of specific words or names:

1stt = AssemblyAISTTService(
2 api_key=os.getenv("ASSEMBLYAI_API_KEY"),
3 connection_params=AssemblyAIConnectionParams(
4 speech_model="u3-rt-pro",
5 min_turn_silence=100,
6 keyterms_prompt=["Xiomara", "Saoirse", "Pipecat", "AssemblyAI"],
7 ),
8)

Dynamic parameter updates

Change configuration mid-conversation without reconnection. See 55d-update-settings-assemblyai-stt.py for a complete working example.

1from pipecat.frames.frames import STTUpdateSettingsFrame
2from pipecat.services.assemblyai.stt import AssemblyAISTTSettings
3from pipecat.services.assemblyai.models import AssemblyAIConnectionParams
4
5# Update keyterms during conversation
6await task.queue_frame(
7 STTUpdateSettingsFrame(
8 delta=AssemblyAISTTSettings(
9 connection_params=AssemblyAIConnectionParams(
10 keyterms_prompt=["NewName", "NewCompany"]
11 )
12 )
13 )
14)
15
16# Update silence thresholds
17await task.queue_frame(
18 STTUpdateSettingsFrame(
19 delta=AssemblyAISTTSettings(
20 connection_params=AssemblyAIConnectionParams(
21 min_turn_silence=200,
22 max_turn_silence=3000, # Only respected in AssemblyAI's built-in turn detection (STT mode)
23 )
24 )
25 )
26)

Speaker diarization

Identify different speakers in multi-party conversations.

Basic diarization

1stt = AssemblyAISTTService(
2 api_key=os.getenv("ASSEMBLYAI_API_KEY"),
3 connection_params=AssemblyAIConnectionParams(
4 speech_model="u3-rt-pro",
5 speaker_labels=True,
6 ),
7)

Speaker labels (e.g., "A", "B", "C") are included in final transcripts and logged.

With custom formatting

Format transcripts with speaker labels for LLM context:

1stt = AssemblyAISTTService(
2 api_key=os.getenv("ASSEMBLYAI_API_KEY"),
3 connection_params=AssemblyAIConnectionParams(
4 speech_model="u3-rt-pro",
5 speaker_labels=True,
6 ),
7 speaker_format="<Speaker {speaker}>{text}</Speaker {speaker}>",
8)

Format options:

StyleFormat string
XML<{speaker}>{text}</{speaker}>
Markdown**{speaker}**: {text}
Bracket[{speaker}] {text}

Daily transport

For production deployments, use the Daily transport for WebRTC-based real-time audio/video communication.

Parameters reference

U3-Pro specific parameters

speech_model
strDefaults to u3-rt-pro

The speech model to use. Defaults to "u3-rt-pro" (Universal-3 Pro).

min_turn_silence
intDefaults to 100

Milliseconds of silence before ending a turn when model is confident. Set to 100 for best latency. (Formerly min_end_of_turn_silence_when_confident, which is deprecated but still supported with a warning.)

max_turn_silence
intDefaults to 1000

Maximum silence before forced turn end. Auto-synced in Pipecat mode; respected in AssemblyAI’s built-in turn detection (STT mode).

keyterms_prompt
list[str]

List of terms to boost recognition for. Cannot be used with prompt.

speaker_labels
boolDefaults to False

Enable speaker diarization.

prompt
str

Custom transcription instructions. Cannot be used with keyterms_prompt. Prompting is currently a beta feature — see Prompting for more information.

General parameters

api_key
strRequired

Your AssemblyAI API key.

vad_force_turn_endpoint
boolDefaults to True

True for Pipecat mode; False for AssemblyAI’s built-in turn detection (STT mode).

speaker_format
str

Template string for formatting speaker labels (e.g., "[{speaker}] {text}").

Running your agent

Development mode (local audio)

$python your_agent.py

Speak into your microphone after hearing the greeting.

Production with Daily

Deploy to Daily.co rooms using the Daily transport. Your agent joins as a participant and handles audio I/O through Daily’s infrastructure.

Speech model comparison

Interested in using a different model?

Featureu3-rt-prouniversal-streaming-englishuniversal-streaming-multilingual
Turn Detection Modes
Pipecat mode (VAD + Smart Turn)
AssemblyAI turn detection mode
Turn Detection Parameters
min_turn_silence
max_turn_silence
end_of_turn_confidence_threshold✅ (1.0)✅ (1.0)
Advanced Features
Keyterms boosting
Custom prompting (beta)
Speaker diarization
Dynamic parameter updates
Language Support
Multilingual code switching
Language detection

Legend:

  • ✅ Fully supported and recommended
  • ❌ Not supported / Not used

u3-rt-pro is the recommended model for all new voice agent implementations. The universal-streaming models are maintained for backward compatibility but lack the optimizations and features specifically designed for real-time conversational AI.

The end_of_turn_confidence_threshold parameter is not used with u3-rt-pro (it won’t affect behavior). For universal-streaming models, Pipecat automatically sets it to 1.0 in Pipecat mode to disable semantic turn detection and ensure fast responses. You don’t need to configure this parameter manually.