Best Practices for building Voice Agents
Introduction
AssemblyAI’s Universal-Streaming model represents a breakthrough in speech-to-text technology specifically designed for conversational AI in voice agents. Unlike traditional streaming models that force developers to choose between speed and reliability, Universal-Streaming delivers immutable transcripts in ~300ms with industry-leading accuracy.
What is Universal-Streaming?
Universal-Streaming is AssemblyAI’s speech-to-text model purpose-built for real-time conversational AI. It’s the first streaming model to deliver:
- Immutable transcripts that never change once received (no retroactive edits)
- ~300ms latency for natural conversation flow
- Intelligent turn detection combining acoustic and semantic analysis
- Industry-leading accuracy trained specifically for conversational speech patterns
- Unlimited concurrent streams with transparent, usage-based pricing
Unlike traditional streaming models that force you to choose between speed and reliability, Universal-Streaming provides both—enabling natural conversations without awkward pauses or mid-sentence interruptions.
Key innovation: While other models send partial transcripts that constantly change (causing downstream processing issues), Universal-Streaming’s immutable transcripts arrive as “finals” from the start. This enables pre-emptive LLM processing while users are still speaking, dramatically reducing response latency.
Why AssemblyAI for Voice Agents?
Voice agents require critical capabilities that traditional speech-to-text solutions struggle to provide:
Speed without sacrificing accuracy
- Immutable transcripts in ~300ms enable instant LLM processing
- No waiting for “final” transcripts that never arrive
- Pre-emptive generation while users are still speaking
Natural conversation flow
- Intelligent turn detection understands context, not just silence
- No more awkward long pauses or mid-sentence interruptions
- Configurable for your specific use case
Production-ready at scale
- Unlimited concurrent streams from day one
- No capacity planning or overage fees
- Pre-built integrations with LiveKit, Pipecat, Vapi
Transparent pricing
- $0.15/hour based on session duration
- Optional keyterms prompting: +$0.04/hour
- No hidden costs or surprise bills
Universal-Streaming addresses the fundamental challenges of voice agents: delivering both speed and reliability while maintaining natural conversation flow and transparent costs.
What Languages and Features Does Universal-Streaming Support?
Language Support
Universal-Streaming supports two modes:
English-only mode (default)
- Full feature support including keyterms prompting
- Optimized for English conversations
- Best performance and lowest latency
Multilingual mode (beta)
- Supports: English, Spanish, French, German, Italian, Portuguese
- Automatic language detection and code-switching
- Maintains context across language changes
- Note: Keyterms prompting not currently supported
To enable multilingual mode, set "language": "multi"
in connection parameters.
Available Features
Core Streaming Features:
- Turn-based immutable transcripts (no retroactive edits)
- Real-time partial transcripts during speech
- Word-level timestamps and confidence scores
- Configurable endpointing (semantic + acoustic detection)
- Force endpoint capability for manual turn control
- Built-in VAD (Voice Activity Detection)
Accuracy Enhancements:
- Keyterms Prompting (English-only): Up to 100 custom terms per session
- 21% better accuracy on domain-specific terminology
- Word-level and turn-level boosting
- Cost: +$0.04/hour
Audio Processing:
- PCM16 and Mu-law encoding support
- Configurable sample rates (16kHz recommended)
- Single-channel audio
- Automatic noise handling
- Background noise robustness
Text Processing:
- Optional text formatting (punctuation, capitalization, ITN)
- Not recommended for voice agents - adds ~200ms latency with no LLM benefit
- Useful for displaying transcripts to end users
Important Limitations:
- Timestamps have wide variance in accuracy - use for relative timing only
- 50ms minimum audio chunk size, 1000ms maximum
Coming Soon (Public Roadmap)
- Multi-region support: EU deployment for lower latency in Europe
- Self-hosted deployment: Docker containers for on-premise use
- Enhanced audio handling: Improved performance with background noise and low-quality audio
- Speaker diarization: Real-time speaker identification (limited utility for most voice agents)
Check our public roadmap for current development status.
How Can I Get Started with Universal-Streaming?
Basic Setup and Terminal Logging
Here’s a complete script that connects to Universal-Streaming and logs all JSON responses to your terminal.
This script will:
- Connect to Universal-Streaming WebSocket API
- Capture audio from your microphone
- Log every JSON message received (partial and final transcripts)
- Highlight key fields like
utterance
,end_of_turn
, andtranscript
- Show when utterances are complete and ready for LLM processing
- Show when turns are predicted to have ended so your voice agent can respond
Installation Requirements
How Do I Build a Voice Agent with AssemblyAI?
AssemblyAI provides speech-to-text only today - you’ll need additional providers for a complete voice agent:
Complete Voice Agent Stack
- Speech-to-Text (STT): AssemblyAI Universal-Streaming
- Large Language Model (LLM): OpenAI, Anthropic, Gemini, Cerebras, etc.
- Text-to-Speech (TTS): Rime, Cartesia, ElevenLabs, etc.
- Orchestration: LiveKit, Pipecat, Vapi, or custom build
Pre-Built Integrations
LiveKit Agents (Recommended Quick Start) LiveKit provides the fastest path to a working voice agent with AssemblyAI:
Pipecat by Daily Pipecat is an open-source framework for conversational AI which allows for maximum customizability in your voice agent:
Vapi Vapi is a developer platform that handles voice agent backend infrastructure:
- Go to Assistants tab in Vapi dashboard
- Select your assistant → Transcriber tab
- Choose “Assembly AI” as provider
- Toggle on “Universal Streaming API”
- Disable “Format Turns” for best latency
See our Vapi integration guide for detailed setup.
Integration Resources and Full Examples
- Building a Voice Agent with LiveKit and AssemblyAI
- Building a Voice Agent with Pipecat and AssemblyAI
How Do I Optimize for Latency with Universal-Streaming?
Fastest Latency Configuration
Latency Optimization Best Practices
1. Never use Text Formatting for Voice Agents
Why? LLMs don’t need formatting - raw text works perfectly. Formatting adds ~200ms latency with zero benefit for voice agents. The LLM processes "hello world"
exactly the same as "Hello, world!"
.
When NOT to disable formatting:
- Displaying transcripts to end users (captions, transcription apps)
- Recording/archiving conversations for human review
- Any scenario where humans read the transcript directly
2. Grab the utterance
field to process Immutable Partials for Pre-emptive Generation
This is especially powerful when using external turn detection models. By default, LiveKit and Pipecat leverage this configuration for pre-emptive generation.
3. Use Aggressive Turn Detection
Note that we recommend testing the end of turn confidence for your use case. You can find our guide on turn detection here.
4. Optimize Audio Pipeline
- Use 16kHz sample rate (balances quality and bandwidth)
- Send smaller audio chunks (512-1024 bytes)
- Minimize buffering in your audio capture
- Keep network latency low (use same region as AssemblyAI servers when possible)
5. Leverage Built-in VAD
AssemblyAI includes VAD in the model. You can substitute Universal-Streaming entirely for VAD by adjusting min_end_of_turn_silence_when_confident
, though this increases latency until the silence threshold has passed. This is recommended most for custom builds where a voice agent orchestrator is not managing the VAD for you.
Voice Agent Specific Tips
- Use
end_of_turn
with VAD: Combine theend_of_turn
parameter with your VAD to determine if the user is continuing to speak or if you can safely interrupt - Skip unnecessary features: Avoid
format_turns
and wait for unformatted transcripts only for the fastest response - Monitor
utterance
: Use this to pre-emptively generate LLM responses while confirming a turn has ended - Monitor
end_of_turn_confidence
: Use this to fine-tune your interruption logic - Configure for your use case: Use Aggressive, Balanced, or Conservative presets based on your needs
Latency vs. Accuracy Trade-offs
How Can I Use Turn Detection in My Voice Agent?
Understanding Turn Detection
Universal-Streaming uses dual detection methods for turn detection:
Semantic Detection (Primary)
- Neural network analyzes meaning and context
- Triggers when
end_of_turn_confidence
>end_of_turn_confidence_threshold
- Understands natural speech patterns
- Handles “um”, pauses, incomplete thoughts
- Minimum silence:
min_end_of_turn_silence_when_confident
(default 400ms)
Acoustic Detection (Fallback)
- Traditional silence-based VAD
- Triggers after
max_turn_silence
duration (default 1280ms) - Ensures reliability for edge cases
- Works even when semantic model has low confidence
When either method detects an end-of-turn, the model returns end_of_turn=True
in the response.
Configuration Examples
Parameter Explanations:
end_of_turn_confidence_threshold
: Raise to wait for higher confidence before ending turn; lower for faster responsesmin_end_of_turn_silence_when_confident
: Amount of silence required after model is confident turn has endedmax_turn_silence
: Maximum silence before forcing turn end (acoustic fallback)
Advanced Turn Control
Disable Model-Based Detection
Force Manual Turn Ending
Monitoring Turn Detection
How Can I Use Immutable Partials for Pre-emptive Generation?
Understanding the Utterance Field
The utterance
field in Universal-Streaming provides the complete finalized text when an utterance ends, even if the turn hasn’t ended yet. This is crucial for pre-emptive LLM generation:
Key Insight: The utterance appeared even though end_of_turn=false
. This means you can start processing the LLM response before the turn officially ends, saving 200-500ms of latency.
Pre-emptive Generation Strategy
When you receive a non-empty utterance
field, you can immediately start LLM processing:
When Utterance Field Appears
The utterance
field appears in these scenarios:
- End of utterance, NOT end of turn (Most useful for pre-generation):
- End of utterance AND end of turn (Standard completion):
- During partial transcripts (Empty, not useful):
Benefits of Using Utterance Field
- Maximum Speed: Start LLM processing before turn ends
- Natural Conversations: Handle pauses without waiting for silence
- Reduced Latency: Save 200-500ms by pre-computing responses
- Better UX: Agent responds instantly when user actually stops
Real-World Example
Voice agent platforms like LiveKit and Pipecat automatically use the utterance field for pre-emptive generation with optimal settings configured. If you’re using these platforms, this logic is already implemented for you.
How Do I Process Messages from the Universal-Streaming API?
Message Sequence Flow
Universal-Streaming sends messages in a specific sequence as users speak. Here’s a complete example of someone saying “Hi my name is Sonny. I am a voice agent.”
1. Session Initialization
2. Partial Transcripts (during “Hi my name is Sonny”)
Words progressively finalize:
3. End of Utterance (Key moment for pre-emptive generation!)
4. Formatted Final (if format_turns=true)
5. New Turn Continues (“I am a voice agent”)
Sometimes the turn ends but the user keeps speaking:
6. Utterance Completes BUT Turn Doesn’t End (Pre-emptive opportunity!)
7. Turn Finally Ends
Processing Strategy
Key Fields to Watch
Best Practices Summary
Voice Agents - Speed Above All:
- Grab
utterance
parameter immediately for fastest response - Never use
format_turns
(adds latency with no LLM benefit) - Use
end_of_turn
+ VAD to determine safe interruption points - Configure Aggressive/Balanced/Conservative presets for your use case
- Process unformatted transcripts only
If you’re using LiveKit or Pipecat, this message processing logic is already optimized and configured for you automatically. You don’t need to implement it yourself.
How Can I Improve the Accuracy of My Transcription?
Using Keyterms Prompting
Keyterms Prompting boosts accuracy for domain-specific vocabulary—product names, people, technical terms:
Keyterms Best Practices
DO Include:
- Proper names and people (“Dr. Sarah Chen”, “Keanu Reeves”)
- Product names (“MacBook Pro M3”, “iPhone 15 Pro”, “Baconator”)
- Technical terminology (“Kubernetes”, “PostgreSQL”, “OAuth”, “React Native”)
- Company-specific jargon (“Q3 roadmap”, “OKRs”, “CSAT score”)
- Menu items or SKUs (“Grande Latte”, “SKU-12345”)
- Domain-specific acronyms (“HIPAA”, “SOC 2”, “GDPR”)
- Brand names (“AssemblyAI”, “LiveKit”, “Pipecat”)
- Up to 50 characters per term
DON’T Include:
- Common English words (“hello”, “information”, “important”)
- Single letters or very short terms (“a”, “it”, “is”)
- More than 100 terms total
- Generic phrases (“thank you”, “have a nice day”)
- Punctuation or special characters
Performance Impact
With Keyterms Prompting enabled:
- 21% better accuracy on domain-specific terms
- No impact on streaming latency
- Additional cost: $0.04/hour
- Two-stage boosting: word-level during streaming + turn-level after completion
How Keyterms Works
Word-level boosting (during streaming)
- Model biased during inference to recognize keyterms
- Happens in real-time as words are emitted
- Provides immediate accuracy improvements
Turn-level boosting (after turn ends)
- Additional post-processing pass with full context
- Only available if
format_turns=true
- For voice agents: skip this by keeping
format_turns=false
Example: Restaurant Ordering Bot
Example: Healthcare Application
How Does the Unlimited Concurrency Work?
Unlimited Concurrent Streaming Sessions
Universal-Streaming provides genuine unlimited concurrent streams with:
- No hard caps on simultaneous connections
- No overage fees for spike traffic
- No pre-purchased capacity requirements
- Automatic scaling from 5 to 50,000+ streams
How It Works
Pricing Model:
- $0.15/hour based on total session duration
- Pay only for actual usage, not capacity
- Volume discounts available for scale
- Optional keyterms: +$0.04/hour
Rate Limits:
- Free users: 5 new streams per minute
- Pay-as-you-go: 100 new streams per minute
- Automatic scaling: When using 70%+ of limit, capacity increases 10% every 60 seconds
- Example: From 100 → 146 new streams/min in 5 minutes (610 total concurrent)
Scaling Behavior: Anytime you are using 70% or more of your current limit, your new sessions rate limit will automatically increase and scale up by 10% every 60 seconds. This means within 5 minutes of sustained usage, you can scale from 100 to 146 new streams per minute (for a total of 610 concurrent streams), with unlimited ceiling as your usage grows.
These limits are designed to never interfere with legitimate applications - normal scaling patterns automatically get more capacity before hitting any walls, while protecting against runaway scripts or abuse. Your baseline limit is guaranteed and never decreases, so you can scale smoothly from dozens to thousands of simultaneous streams without artificial barriers or surprise fees.
No Hidden Limits
What AssemblyAI Does:
- Instant scaling without configuration
- Same performance at any scale
- No degradation under load
- No surprise bills
The unlimited concurrency means you can:
- Handle viral moments without preparation
- Scale globally without regional limits
- Build platforms without usage anxiety
- Focus on your product, not capacity planning
- Handle flash sales, breaking news, or unexpected traffic spikes
Real-world example: If you suddenly need to scale from 100 to 1,000 concurrent streams for a product launch, the system automatically adjusts within minutes. No manual intervention, no pre-planning, no emergency capacity requests.