Universal-3 Pro on Pipecat
Overview
This guide covers integrating AssemblyAI’s Universal-3 Pro (u3-rt-pro) streaming speech-to-text model into Pipecat voice agents.
Universal-3 Pro for streaming is optimized for real-time audio utterances typically under 10 seconds, with special efficiencies built in for low-latency turn detection and voice agent workflows. It provides the highest accuracy with native multilingual code switching, entity accuracy, and prompting support.
Universal-3 Pro streaming delivers exceptional entity and alphanumeric accuracy, including credit card numbers, cell phone numbers, email addresses, physical addresses, and names — all with sub-300ms time to complete transcript latency.
Key features
- Two-mode turn detection: Choose between Pipecat-controlled (VAD + Smart Turn) or AssemblyAI’s built-in turn detection (STT-based)
- Keyterms boosting: Improve recognition of specific words and names
- Dynamic parameter updates: Change configuration mid-conversation without reconnection
- Speaker diarization: Identify and label different speakers in multi-party conversations
Examples
Complete working examples are available in the Pipecat repository:
- 07o-interruptible-assemblyai.py — Interruptible voice agent with U3-Pro using Pipecat-based turn detection (VAD + Smart Turn)
- 07o-interruptible-assemblyai-stt.py — Interruptible voice agent with U3-Pro using AssemblyAI’s built-in turn detection (STT-based)
You can run any example directly as long as your API keys are saved in a .env file:
The vad_force_turn_endpoint parameter controls which turn detection mode is used. It defaults to True (Pipecat mode), which sends a ForceEndpoint message to AssemblyAI when the local VAD detects silence. Set it to False to use AssemblyAI’s built-in turn detection instead. Choosing the right mode is critical for balancing responsiveness and turn accuracy in your voice agent.
Installation
Install Pipecat with all required dependencies:
What’s included:
assemblyai— AssemblyAI U3-Pro STT serviceopenai— OpenAI LLM service (used in the examples)cartesia— Cartesia TTS service (used in the examples)
The examples use OpenAI and Cartesia, but you can use any LLM or TTS you want that’s supported by Pipecat. Just swap out the extras in the install command (e.g., pipecat-ai[assemblyai,anthropic,elevenlabs]).
Authentication
Set your API keys in a .env file:
You can obtain an AssemblyAI API key by signing up here.
Two-mode turn detection
Within Pipecat, you have two distinct approaches to turn detection with AssemblyAI’s U3-Pro model.
Pipecat mode (default, recommended)
When to use: Most voice agent applications requiring responsive interruptions.
How it works:
- VAD + Smart Turn analyzer controls when the user is done speaking
ForceEndpointmessage sent to AssemblyAI on VAD silence detectionmax_turn_silenceautomatically synchronized withmin_turn_silence- Best for low-latency, responsive voice agents
AssemblyAI’s built-in turn detection (STT mode)
When to use: When you want AssemblyAI’s built-in turn detection to control turn endings. This mode is configurable within the connection parameters — see Configuring turn detection to understand how it works.
How it works:
- AssemblyAI’s built-in turn detection controls when the user is done speaking
- All timing parameters are respected as configured
- Emits
UserStartedSpeakingFrame/UserStoppedSpeakingFrame - Uses
SpeechStartedevents for fast barge-in - Only available with
u3-rt-pro(other models require Pipecat mode)
AssemblyAI’s built-in turn detection uses the STT model’s understanding of speech patterns to determine turn boundaries, rather than relying on local VAD silence detection.
Keyterms boosting
Improve recognition of specific words or names:
Dynamic parameter updates
Change configuration mid-conversation without reconnection. See 55d-update-settings-assemblyai-stt.py for a complete working example.
Speaker diarization
Identify different speakers in multi-party conversations.
Basic diarization
Speaker labels (e.g., "A", "B", "C") are included in final transcripts and logged.
With custom formatting
Format transcripts with speaker labels for LLM context:
Format options:
Daily transport
For production deployments, use the Daily transport for WebRTC-based real-time audio/video communication.
Parameters reference
U3-Pro specific parameters
The speech model to use. Defaults to "u3-rt-pro" (Universal-3 Pro).
Milliseconds of silence before ending a turn when model is confident. Set to
100 for best latency. (Formerly min_end_of_turn_silence_when_confident,
which is deprecated but still supported with a warning.)
Maximum silence before forced turn end. Auto-synced in Pipecat mode; respected in AssemblyAI’s built-in turn detection (STT mode).
List of terms to boost recognition for. Cannot be used with prompt.
Enable speaker diarization.
Custom transcription instructions. Cannot be used with keyterms_prompt. Prompting is currently a beta feature — see Prompting for more information.
General parameters
Your AssemblyAI API key.
True for Pipecat mode; False for AssemblyAI’s built-in turn detection (STT mode).
Template string for formatting speaker labels (e.g.,
"[{speaker}] {text}").
Running your agent
Development mode (local audio)
Speak into your microphone after hearing the greeting.
Production with Daily
Deploy to Daily.co rooms using the Daily transport. Your agent joins as a participant and handles audio I/O through Daily’s infrastructure.
Speech model comparison
Interested in using a different model?
Legend:
- ✅ Fully supported and recommended
- ❌ Not supported / Not used
u3-rt-pro is the recommended model for all new voice agent implementations. The universal-streaming models are maintained for backward compatibility but lack the optimizations and features specifically designed for real-time conversational AI.
The end_of_turn_confidence_threshold parameter is not used with u3-rt-pro (it won’t affect behavior). For universal-streaming models, Pipecat automatically sets it to 1.0 in Pipecat mode to disable semantic turn detection and ensure fast responses. You don’t need to configure this parameter manually.