Best Practices for building Voice Agents | AssemblyAI

Introduction

AssemblyAI’s Universal-Streaming model represents a breakthrough in speech-to-text technology specifically designed for conversational AI in voice agents. Unlike traditional streaming models that force developers to choose between speed and reliability, Universal-Streaming delivers immutable transcripts in ~300ms with industry-leading accuracy.

What is Universal-Streaming?

Universal-Streaming is AssemblyAI’s speech-to-text model purpose-built for real-time conversational AI. It’s the first streaming model to deliver:

Immutable transcripts that never change once received (no retroactive edits)
~300ms latency for natural conversation flow
Intelligent turn detection combining acoustic and semantic analysis
Industry-leading accuracy trained specifically for conversational speech patterns
Unlimited concurrent streams with transparent, usage-based pricing

Unlike traditional streaming models that force you to choose between speed and reliability, Universal-Streaming provides both—enabling natural conversations without awkward pauses or mid-sentence interruptions.

Key innovation: While other models send partial transcripts that constantly change (causing downstream processing issues), Universal-Streaming’s immutable transcripts arrive as “finals” from the start. This enables pre-emptive LLM processing while users are still speaking, dramatically reducing response latency.

Why AssemblyAI for Voice Agents?

Voice agents require critical capabilities that traditional speech-to-text solutions struggle to provide:

Speed without sacrificing accuracy

Immutable transcripts in ~300ms enable instant LLM processing
No waiting for “final” transcripts that never arrive
Pre-emptive generation while users are still speaking

Natural conversation flow

Intelligent turn detection understands context, not just silence
No more awkward long pauses or mid-sentence interruptions
Configurable for your specific use case

Production-ready at scale

Unlimited concurrent streams from day one
No capacity planning or overage fees
Pre-built integrations with LiveKit, Pipecat, Vapi

Transparent pricing

$0.15/hour based on session duration
Optional keyterms prompting: +$0.04/hour
No hidden costs or surprise bills

Universal-Streaming addresses the fundamental challenges of voice agents: delivering both speed and reliability while maintaining natural conversation flow and transparent costs.

What Languages and Features Does Universal-Streaming Support?

Language Support

Universal-Streaming supports two modes:

English-only mode (default)

Full feature support including keyterms prompting
Optimized for English conversations
Best performance and lowest latency

Multilingual mode (beta)

Supports: English, Spanish, French, German, Italian, Portuguese
Automatic language detection and code-switching
Maintains context across language changes

To enable multilingual mode, set "language": "multi" in connection parameters.

Available Features

Core Streaming Features:

Turn-based immutable transcripts (no retroactive edits)
Real-time partial transcripts during speech
Word-level timestamps and confidence scores
Configurable endpointing (semantic + acoustic detection)
Force endpoint capability for manual turn control
Built-in VAD (Voice Activity Detection)

Accuracy Enhancements:

Keyterms Prompting: Up to 100 custom terms per session
- 21% better accuracy on domain-specific terminology
- Word-level and turn-level boosting
- Cost: +$0.04/hour

Audio Processing:

PCM16 and Mu-law encoding support
Configurable sample rates (16kHz recommended)
Single-channel audio
Automatic noise handling
Background noise robustness

Text Processing:

Optional text formatting (punctuation, capitalization, ITN)
Not recommended for voice agents - adds ~200ms latency with no LLM benefit
Useful for displaying transcripts to end users

Important Limitations:

Timestamps have wide variance in accuracy - use for relative timing only
50ms minimum audio chunk size, 1000ms maximum

Coming Soon (Public Roadmap)

Multi-region support: EU deployment for lower latency in Europe
Self-hosted deployment: Docker containers for on-premise use
Enhanced audio handling: Improved performance with background noise and low-quality audio
Speaker diarization: Real-time speaker identification (limited utility for most voice agents)

Check our public roadmap for current development status.

How Can I Get Started with Universal-Streaming?

Basic Setup and Terminal Logging

Here’s a complete script that connects to Universal-Streaming and logs all JSON responses to your terminal.

This script will:

Connect to Universal-Streaming WebSocket API
Capture audio from your microphone
Log every JSON message received (partial and final transcripts)
Highlight key fields like utterance, end_of_turn, and transcript
Show when utterances are complete and ready for LLM processing
Show when turns are predicted to have ended so your voice agent can respond

1 import asyncio
2 import json
3 import websockets
4 from urllib.parse import urlencode
5 import pyaudio
6 import threading
7 from queue import Queue
8 
9 # Configuration
10 API_KEY = "YOUR_API_KEY_HERE"  # Replace with your AssemblyAI API key
11 SAMPLE_RATE = 16000
12 CHUNK_SIZE = 1024
13 
14 # WebSocket connection parameters
15 CONNECTION_PARAMS = {
16     "sample_rate": SAMPLE_RATE,
17     "format_turns": False,  # CRITICAL: Unformatted for fastest response (~200ms savings)
18     "end_of_turn_confidence_threshold": 0.4,  # Balanced turn detection
19     "min_end_of_turn_silence_when_confident": 160,  # ms after confident EOT
20     "max_turn_silence": 1280  # Acoustic fallback threshold
21 }
22 
23 # Build WebSocket URL
24 API_ENDPOINT_BASE = "wss://streaming.assemblyai.com/v3/ws"
25 API_ENDPOINT = f"{API_ENDPOINT_BASE}?{urlencode(CONNECTION_PARAMS)}"
26 
27 # Audio queue for thread-safe audio handling
28 audio_queue = Queue()
29 
30 def audio_callback(in_data, frame_count, time_info, status):
31     """Callback for audio stream - adds audio to queue"""
32     audio_queue.put(in_data)
33     return (None, pyaudio.paContinue)
34 
35 async def send_audio(websocket):
36     """Send audio data to WebSocket"""
37     print("📤 Starting audio sender...")
38     while True:
39         if not audio_queue.empty():
40             audio_data = audio_queue.get()
41             await websocket.send(audio_data)
42         else:
43             await asyncio.sleep(0.01)
44 
45 async def receive_transcripts(websocket):
46     """Receive and log all transcripts"""
47     print("📥 Starting transcript receiver...")
48     while True:
49         try:
50             message = await websocket.recv()
51             data = json.loads(message)
52             
53             # Log complete JSON response
54             print("\n" + "="*50)
55             print("📝 RECEIVED MESSAGE:")
56             print(json.dumps(data, indent=2))
57             
58             # Highlight important fields
59             if data.get("type") == "Turn":
60                 print("\n🔍 KEY FIELDS:")
61                 print(f"  Turn Order: {data.get('turn_order')}")
62                 print(f"  Transcript: '{data.get('transcript')}'")
63                 print(f"  End of Turn: {data.get('end_of_turn')}")
64                 print(f"  EOT Confidence: {data.get('end_of_turn_confidence', 0):.3f}")
65                 print(f"  Utterance: {data.get('utterance')}")
66                 
67                 # KEY INSIGHT: Utterance field enables pre-emptive generation
68                 if data.get("utterance"):
69                     print("\n✅ UTTERANCE AVAILABLE - Ready for LLM processing!")
70                     print("   💡 Start generating LLM response now, don't wait for end_of_turn")
71                 
72                 # KEY INSIGHT: end_of_turn signals when to respond
73                 if data.get("end_of_turn"):
74                     print("\n🎯 END OF TURN - User finished speaking, agent can respond")
75                     
76         except websockets.exceptions.ConnectionClosed:
77             print("❌ Connection closed")
78             break
79         except Exception as e:
80             print(f"❌ Error: {e}")
81 
82 async def main():
83     """Main function to coordinate streaming"""
84     print("🚀 Universal-Streaming Terminal Logger")
85     print(f"📡 Connecting to: {API_ENDPOINT_BASE}")
86     print(f"🔧 Configuration: {json.dumps(CONNECTION_PARAMS, indent=2)}")
87     print("-" * 50)
88     
89     # Set up audio stream
90     p = pyaudio.PyAudio()
91     stream = p.open(
92         format=pyaudio.paInt16,
93         channels=1,
94         rate=SAMPLE_RATE,
95         input=True,
96         frames_per_buffer=CHUNK_SIZE,
97         stream_callback=audio_callback
98     )
99     
100     # Connect to WebSocket with auth header
101     headers = {"Authorization": API_KEY}
102     
103     try:
104         async with websockets.connect(API_ENDPOINT, extra_headers=headers) as websocket:
105             print("✅ Connected to Universal-Streaming!")
106             print("🎤 Start speaking... (Press Ctrl+C to stop)\n")
107             
108             # Start audio stream
109             stream.start_stream()
110             
111             # Run send and receive concurrently
112             await asyncio.gather(
113                 send_audio(websocket),
114                 receive_transcripts(websocket)
115             )
116             
117     except KeyboardInterrupt:
118         print("\n👋 Stopping...")
119     finally:
120         stream.stop_stream()
121         stream.close()
122         p.terminate()
123         print("✅ Cleaned up resources")
124 
125 if __name__ == "__main__":
126     asyncio.run(main())

Installation Requirements

$ pip install websockets pyaudio

How Do I Build a Voice Agent with AssemblyAI?

AssemblyAI provides speech-to-text only today - you’ll need additional providers for a complete voice agent:

Complete Voice Agent Stack

Speech-to-Text (STT): AssemblyAI Universal-Streaming
Large Language Model (LLM): OpenAI, Anthropic, Gemini, Cerebras, etc.
Text-to-Speech (TTS): Rime, Cartesia, ElevenLabs, etc.
Orchestration: LiveKit, Pipecat, Vapi, or custom build

Pre-Built Integrations

LiveKit Agents (Recommended Quick Start) LiveKit provides the fastest path to a working voice agent with AssemblyAI:

1 # LiveKit + AssemblyAI Quick Start
2 from livekit import agents
3 from livekit.plugins import assemblyai, openai, rime
4 
5 async def create_voice_agent():
6     # Initialize AssemblyAI STT
7     stt = assemblyai.STT(
8         api_key="your_assemblyai_key",
9         end_of_turn_confidence_threshold=0.4,
10         min_end_of_turn_silence_when_confident=160,
11         format_turns=False  # CRITICAL: Faster without formatting
12     )
13     
14     # Add LLM
15     llm = openai.LLM(api_key="your_openai_key")
16     
17     # Add TTS
18     tts = rime.TTS(api_key="your_rime_key")
19     
20     # Create agent
21     agent = agents.VoiceAssistant(
22         stt=stt,
23         llm=llm,
24         tts=tts,
25         turn_detection="stt"  # Use AssemblyAI's turn detection
26     )
27     
28     return agent

Pipecat by Daily Pipecat is an open-source framework for conversational AI which allows for maximum customizability in your voice agent:

1 from pipecat.services.assemblyai import AssemblyAISTTService
2 from pipecat.services.openai import OpenAILLMService
3 from pipecat.services.rime import RimeTTSService
4 
5 # Configure services
6 stt = AssemblyAISTTService(
7     api_key="your_key",
8     connection_params=AssemblyAIConnectionParams(
9         end_of_turn_confidence_threshold=0.4,
10         min_end_of_turn_silence_when_confident=160,  # ms after confident EOT
11         format_turns=False  # CRITICAL: Faster without formatting
12     ),
13     vad_force_turn_endpoint=False,  # Rely on AssemblyAI's EOT, not VAD
14 )
15 llm = OpenAILLMService(api_key="your_key")
16 tts = RimeTTSService(api_key="your_key")

Vapi Vapi is a developer platform that handles voice agent backend infrastructure:

Go to Assistants tab in Vapi dashboard
Select your assistant → Transcriber tab
Choose “Assembly AI” as provider
Toggle on “Universal Streaming API”
Disable “Format Turns” for best latency

See our Vapi integration guide for detailed setup.

Integration Resources and Full Examples

How Do I Optimize for Latency with Universal-Streaming?

Fastest Latency Configuration

1 # Maximum speed configuration
2 latency_optimized_params = {
3     "sample_rate": 16000,
4     "encoding": "pcm_s16le",
5     
6     # CRITICAL latency optimizations
7     "format_turns": False,  # NEVER use for voice agents - saves ~200ms
8     
9     # Aggressive turn detection for quick responses
10     "end_of_turn_confidence_threshold": 0.4,  # Lower = faster detection
11     "min_end_of_turn_silence_when_confident": 160,  # Minimal silence required
12     "max_turn_silence": 800,  # Faster fallback
13     
14     # Audio optimization
15     "chunk_size": 512  # Smaller chunks = lower latency
16 }

Latency Optimization Best Practices

1. Never use Text Formatting for Voice Agents

1 latency_optimized_params = {
2     "format_turns": False  # Saves ~200ms per response
3 }

Why? LLMs don’t need formatting - raw text works perfectly. Formatting adds ~200ms latency with zero benefit for voice agents. The LLM processes "hello world" exactly the same as "Hello, world!".

When NOT to disable formatting:

Displaying transcripts to end users (captions, transcription apps)
Recording/archiving conversations for human review
Any scenario where humans read the transcript directly

2. Grab the utterance field to process Immutable Partials for Pre-emptive Generation

1 if data.get("utterance"):  # Complete utterance ready
2     # Start LLM immediately - don't wait for end_of_turn
3     asyncio.create_task(generate_response(data["utterance"]))

This is especially powerful when using external turn detection models. By default, LiveKit and Pipecat leverage this configuration for pre-emptive generation.

3. Use Aggressive Turn Detection

1 # For rapid back-and-forth (customer service, quick confirmations)
2 latency_optimized_params = {
3     "end_of_turn_confidence_threshold": 0.4,
4     "min_end_of_turn_silence_when_confident": 160
5 }

Note that we recommend testing the end of turn confidence for your use case. You can find our guide on turn detection here.

4. Optimize Audio Pipeline

Use 16kHz sample rate (balances quality and bandwidth)
Send smaller audio chunks (512-1024 bytes)
Minimize buffering in your audio capture
Keep network latency low (use same region as AssemblyAI servers when possible)

5. Leverage Built-in VAD AssemblyAI includes VAD in the model. You can substitute Universal-Streaming entirely for VAD by adjusting min_end_of_turn_silence_when_confident, though this increases latency until the silence threshold has passed. This is recommended most for custom builds where a voice agent orchestrator is not managing the VAD for you.

Voice Agent Specific Tips

Use end_of_turn with VAD: Combine the end_of_turn parameter with your VAD to determine if the user is continuing to speak or if you can safely interrupt
Skip unnecessary features: Avoid format_turns and wait for unformatted transcripts only for the fastest response
Monitor utterance: Use this to pre-emptively generate LLM responses while confirming a turn has ended
Monitor end_of_turn_confidence: Use this to fine-tune your interruption logic
Configure for your use case: Use Aggressive, Balanced, or Conservative presets based on your needs

Latency vs. Accuracy Trade-offs

Configuration	TTFT	End-to-End	Best For
Aggressive	~200ms	~500ms	Quick confirmations, IVR
Balanced	~300ms	~800ms	Most conversations
Conservative	~400ms	~1200ms	Complex instructions

How Can I Use Turn Detection in My Voice Agent?

Understanding Turn Detection

Universal-Streaming uses dual detection methods for turn detection:

Semantic Detection (Primary)

Neural network analyzes meaning and context
Triggers when end_of_turn_confidence > end_of_turn_confidence_threshold
Understands natural speech patterns
Handles “um”, pauses, incomplete thoughts
Minimum silence: min_end_of_turn_silence_when_confident (default 400ms)

Acoustic Detection (Fallback)

Traditional silence-based VAD
Triggers after max_turn_silence duration (default 1280ms)
Ensures reliability for edge cases
Works even when semantic model has low confidence

When either method detects an end-of-turn, the model returns end_of_turn=True in the response.

Configuration Examples

1 # Aggressive - Fast customer service
2 # Best for: IVR, order confirmations, yes/no questions
3 aggressive_params = {
4     "end_of_turn_confidence_threshold": 0.3,  # Lower threshold = faster detection
5     "min_end_of_turn_silence_when_confident": 160,  # Less silence needed
6     "max_turn_silence": 800  # Quick acoustic fallback
7 }
8 
9 # Balanced - Natural conversations (DEFAULT)
10 # Best for: Customer support, tech support, general conversations
11 balanced_params = {
12     "end_of_turn_confidence_threshold": 0.4,
13     "min_end_of_turn_silence_when_confident": 400,
14     "max_turn_silence": 1280
15 }
16 
17 # Conservative - Medical dictation, complex instructions
18 # Best for: Healthcare, legal, detailed technical support
19 conservative_params = {
20     "end_of_turn_confidence_threshold": 0.5,  # Higher threshold = more patient
21     "min_end_of_turn_silence_when_confident": 560,  # More silence before confirming
22     "max_turn_silence": 2000  # Longer acoustic fallback
23 }

Parameter Explanations:

end_of_turn_confidence_threshold: Raise to wait for higher confidence before ending turn; lower for faster responses
min_end_of_turn_silence_when_confident: Amount of silence required after model is confident turn has ended
max_turn_silence: Maximum silence before forcing turn end (acoustic fallback)

Advanced Turn Control

Disable Model-Based Detection

1 # Use VAD-only mode (silence-based only)
2 params = {"end_of_turn_confidence_threshold": 1.0}
3 
4 # Use an external turn detection model (fastest possible)
5 params = {"end_of_turn_confidence_threshold": 0.01}

Force Manual Turn Ending

1 # Send ForceEndpoint event to immediately end current turn
2 await websocket.send(json.dumps({
3     "type": "ForceEndpoint"
4 }))

Monitoring Turn Detection

1 def analyze_turn_detection(data):
2     eot_confidence = data.get("end_of_turn_confidence", 0)
3     
4     if eot_confidence > 0.7:
5         print("High confidence turn end - user definitely finished")
6     elif eot_confidence > 0.4:
7         print("Moderate confidence - may continue speaking")
8     else:
9         print("Low confidence - likely mid-sentence or thinking")

How Can I Use Immutable Partials for Pre-emptive Generation?

Understanding the Utterance Field

The utterance field in Universal-Streaming provides the complete finalized text when an utterance ends, even if the turn hasn’t ended yet. This is crucial for pre-emptive LLM generation:

1 # Example message showing utterance field usage
2 {
3     "turn_order": 1,
4     "turn_is_formatted": false,
5     "end_of_turn": false,  # Turn NOT ended yet
6     "transcript": "i am a voice",
7     "end_of_turn_confidence": 0.454,
8     "utterance": "I am a voice agent.",  # Complete utterance for pre-generation!
9     "type": "Turn"
10 }

Key Insight: The utterance appeared even though end_of_turn=false. This means you can start processing the LLM response before the turn officially ends, saving 200-500ms of latency.

Pre-emptive Generation Strategy

When you receive a non-empty utterance field, you can immediately start LLM processing:

1 async def handle_streaming_message(data):
2     utterance_text = data.get("utterance")
3     transcript = data.get("transcript")
4     is_final = data.get("end_of_turn")
5     turn_order = data.get("turn_order")
6     
7     # Check for utterance completion (pre-emptive opportunity)
8     if utterance_text:
9         # FASTEST: Start LLM processing on complete utterance
10         # even though turn hasn't ended
11         print(f"🚀 Pre-emptive generation for: '{utterance_text}'")
12         
13         # Start async LLM call immediately
14         llm_task = asyncio.create_task(
15             generate_llm_response(utterance_text, turn_order)
16         )
17         
18         # Cache for later use
19         utterance_cache[turn_order] = {
20             "utterance": utterance_text,
21             "llm_task": llm_task
22         }
23     
24     elif is_final:
25         # Turn has ended - use pre-computed response if available
26         if turn_order in utterance_cache:
27             # Response already computing or ready!
28             llm_response = await utterance_cache[turn_order]["llm_task"]
29             print(f"✅ Using pre-computed response (saved 200-500ms!)")
30             return llm_response
31         else:
32             # Fallback: generate response now
33             return await generate_llm_response(transcript, turn_order)
34 
35 async def generate_llm_response(text, turn_order):
36     """Generate LLM response for utterance"""
37     response = await llm.complete(
38         prompt=f"Respond to: {text}",
39         metadata={"turn_order": turn_order}
40     )
41     return response

When Utterance Field Appears

The utterance field appears in these scenarios:

End of utterance, NOT end of turn (Most useful for pre-generation):

1 {
2     "end_of_turn": false,
3     "utterance": "Hi my name is sonny",  // Complete utterance ready
4     "end_of_turn_confidence": 0.454       // Not confident enough to end turn
5 }

End of utterance AND end of turn (Standard completion):

1 {
2     "end_of_turn": true,
3     "utterance": "Hi my name is sonny",
4     "end_of_turn_confidence": 0.5005      // Confident enough to end turn
5 }

During partial transcripts (Empty, not useful):

1 {
2     "end_of_turn": false,
3     "utterance": "",  // Empty during partials
4     "transcript": "hi my"
5 }

Benefits of Using Utterance Field

Maximum Speed: Start LLM processing before turn ends
Natural Conversations: Handle pauses without waiting for silence
Reduced Latency: Save 200-500ms by pre-computing responses
Better UX: Agent responds instantly when user actually stops

Real-World Example

1 # Voice agent with pre-emptive generation
2 class FastVoiceAgent:
3     def __init__(self):
4         self.utterance_cache = {}
5         
6     async def process_stream(self, data):
7         turn_order = data.get("turn_order")
8         utterance = data.get("utterance")
9         
10         # Log for debugging
11         print(f"Turn {turn_order}: utterance='{utterance}' eot={data.get('end_of_turn')}")
12         
13         if utterance and not data.get("end_of_turn"):
14             # Pre-emptive path: ~300ms head start!
15             print(f"⚡ Starting pre-emptive generation for turn {turn_order}")
16             self.utterance_cache[turn_order] = asyncio.create_task(
17                 self.generate_response(utterance)
18             )
19         
20         if data.get("end_of_turn"):
21             # Turn ended - use cached or generate now
22             if turn_order in self.utterance_cache:
23                 response = await self.utterance_cache[turn_order]
24                 print(f"🎯 Pre-computed response ready instantly!")
25             else:
26                 response = await self.generate_response(data.get("transcript"))
27                 print(f"⏱️ Generated response on-demand")
28             
29             return response

Voice agent platforms like LiveKit and Pipecat automatically use the utterance field for pre-emptive generation with optimal settings configured. If you’re using these platforms, this logic is already implemented for you.

How Do I Process Messages from the Universal-Streaming API?

Message Sequence Flow

Universal-Streaming sends messages in a specific sequence as users speak. Here’s a complete example of someone saying “Hi my name is Sonny. I am a voice agent.”

1. Session Initialization

1 {
2     "type": "Begin",
3     "id": "de5d9927-73a6-4be8-b52d-b4c07be37e6b",
4     "expires_at": 1759796682
5 }

2. Partial Transcripts (during “Hi my name is Sonny”)

1 {
2     "type": "Turn",
3     "turn_order": 0,
4     "turn_is_formatted": false,
5     "end_of_turn": false,
6     "transcript": "",  // Empty initially
7     "utterance": "",   // Empty during partials
8     "end_of_turn_confidence": 0.017,
9     "words": [
10         {
11             "text": "hi",
12             "word_is_final": false  // Still processing
13         }
14     ]
15 }

Words progressively finalize:

1 {
2     "transcript": "hi my name is",  // Growing transcript
3     "utterance": "",  // Still empty - not end of utterance yet
4     "end_of_turn": false,
5     "words": [
6         {"text": "hi", "word_is_final": true},
7         {"text": "my", "word_is_final": true},
8         {"text": "name", "word_is_final": true},
9         {"text": "is", "word_is_final": true},
10         {"text": "sonny", "word_is_final": false}  // Latest word
11     ]
12 }

3. End of Utterance (Key moment for pre-emptive generation!)

1 {
2     "type": "Turn",
3     "turn_order": 0,
4     "turn_is_formatted": false,
5     "end_of_turn": true,  // Turn also ends in this case
6     "transcript": "hi my name is sonny",
7     "utterance": "Hi my name is sonny",  // ⚡ COMPLETE UTTERANCE!
8     "end_of_turn_confidence": 0.5005,
9     "words": [/* all words with word_is_final: true */]
10 }

4. Formatted Final (if format_turns=true)

1 {
2     "turn_order": 0,
3     "turn_is_formatted": true,
4     "end_of_turn": true,
5     "transcript": "Hi, my name is Sonny.",  // Formatted with punctuation
6     "utterance": "",  // Empty after formatting
7 }

5. New Turn Continues (“I am a voice agent”)

Sometimes the turn ends but the user keeps speaking:

1 {
2     "turn_order": 1,  // New turn number
3     "end_of_turn": false,
4     "transcript": "i am a voice",
5     "utterance": "",  // Empty during partials
6 }

6. Utterance Completes BUT Turn Doesn’t End (Pre-emptive opportunity!)

1 {
2     "turn_order": 1,
3     "end_of_turn": false,  // ⚠️ Turn NOT ended
4     "transcript": "i am a voice",
5     "utterance": "I am a voice agent.",  // ⚡ Complete utterance ready!
6     "end_of_turn_confidence": 0.454  // Not confident enough to end
7 }

7. Turn Finally Ends

1 {
2     "turn_order": 1,
3     "end_of_turn": true,  // Now turn ends
4     "transcript": "i am a voice agent",
5     "utterance": "",  // Empty when turn ends
6     "end_of_turn_confidence": 0.751
7 }

Processing Strategy

1 class TranscriptProcessor:
2     def __init__(self):
3         self.current_turn = 0
4         self.pre_computed_responses = {}
5         
6     async def process_message(self, data):
7         msg_type = data.get("type")
8         
9         if msg_type == "Begin":
10             print(f"Session started: {data.get('id')}")
11             
12         elif msg_type == "Turn":
13             return await self.process_turn(data)
14             
15         elif msg_type == "Termination":
16             print(f"Session ended: {data.get('message')}")
17             
18     async def process_turn(self, data):
19         turn_order = data.get("turn_order")
20         transcript = data.get("transcript")
21         utterance = data.get("utterance")
22         is_final = data.get("end_of_turn")
23         eot_confidence = data.get("end_of_turn_confidence", 0)
24         
25         # Track turn changes
26         if turn_order != self.current_turn:
27             self.current_turn = turn_order
28             print(f"\n🔄 New turn #{turn_order}")
29         
30         # Monitor progress
31         print(f"  Confidence: {eot_confidence:.3f} | Transcript: '{transcript}'")
32         
33         # KEY: Check for utterance completion (pre-emptive opportunity)
34         if utterance:
35             print(f"  ⚡ UTTERANCE COMPLETE: '{utterance}' - Starting pre-generation!")
36             
37             # Start LLM processing immediately
38             self.pre_computed_responses[turn_order] = asyncio.create_task(
39                 self.generate_response(utterance)
40             )
41         
42         # Handle turn completion
43         if is_final:
44             print(f"  ✅ TURN COMPLETE: '{transcript}'")
45             
46             # Use pre-computed response if available
47             if turn_order in self.pre_computed_responses:
48                 response = await self.pre_computed_responses[turn_order]
49                 print(f"  🎯 Using pre-computed response (instant!)")
50             else:
51                 response = await self.generate_response(transcript)
52                 print(f"  ⏱️ Generating response now")
53                 
54             return response
55             
56     async def generate_response(self, text):
57         # Your LLM call
58         return f"Response to: {text}"

Key Fields to Watch

Field	Purpose	When to Act
`utterance`	Complete utterance text	Non-empty = start pre-generation
`end_of_turn`	Turn completion flag	true = process final response
`end_of_turn_confidence`	Turn ending confidence	Monitor for debugging
`word_is_final`	Word finalization	All true = utterance ending soon
`turn_order`	Turn counter	Changes = new speaking turn

Best Practices Summary

Voice Agents - Speed Above All:

Grab utterance parameter immediately for fastest response
Never use format_turns (adds latency with no LLM benefit)
Use end_of_turn + VAD to determine safe interruption points
Configure Aggressive/Balanced/Conservative presets for your use case
Process unformatted transcripts only

If you’re using LiveKit or Pipecat, this message processing logic is already optimized and configured for you automatically. You don’t need to implement it yourself.

How Can I Improve the Accuracy of My Transcription?

Using Keyterms Prompting

Keyterms Prompting boosts accuracy for domain-specific vocabulary—product names, people, technical terms:

1 # Include keyterms in connection parameters
2 keyterms = [
3     "AssemblyAI",
4     "Universal-Streaming", 
5     "Baconator",
6     "Dr. Rodriguez",
7     "iPhone 15 Pro",
8     "NASDAQ",
9     "PostgreSQL"
10 ]
11 
12 CONNECTION_PARAMS = {
13     "keyterms_prompt": json.dumps(keyterms)  # Up to 100 terms
14 }
15 
16 # Build URL with keyterms
17 API_ENDPOINT = f"{API_ENDPOINT_BASE}?{urlencode(CONNECTION_PARAMS)}"

Keyterms Best Practices

DO Include:

Proper names and people (“Dr. Sarah Chen”, “Keanu Reeves”)
Product names (“MacBook Pro M3”, “iPhone 15 Pro”, “Baconator”)
Technical terminology (“Kubernetes”, “PostgreSQL”, “OAuth”, “React Native”)
Company-specific jargon (“Q3 roadmap”, “OKRs”, “CSAT score”)
Menu items or SKUs (“Grande Latte”, “SKU-12345”)
Domain-specific acronyms (“HIPAA”, “SOC 2”, “GDPR”)
Brand names (“AssemblyAI”, “LiveKit”, “Pipecat”)
Up to 50 characters per term

DON’T Include:

Common English words (“hello”, “information”, “important”)
Single letters or very short terms (“a”, “it”, “is”)
More than 100 terms total
Generic phrases (“thank you”, “have a nice day”)
Punctuation or special characters

Performance Impact

With Keyterms Prompting enabled:

21% better accuracy on domain-specific terms
No impact on streaming latency
Additional cost: $0.04/hour
Two-stage boosting: word-level during streaming + turn-level after completion

How Keyterms Works

Word-level boosting (during streaming)

Model biased during inference to recognize keyterms
Happens in real-time as words are emitted
Provides immediate accuracy improvements

Turn-level boosting (after turn ends)

Additional post-processing pass with full context
Only available if format_turns=true
For voice agents: skip this by keeping format_turns=false

Example: Restaurant Ordering Bot

1 # Restaurant-specific keyterms
2 restaurant_terms = [
3     # Menu items
4     "Baconator",
5     "Frosty",
6     "Dave's Single",
7     "Biggie Bag",
8     "Spicy Chicken Sandwich",
9     
10     # Customizations
11     "Extra pickles",
12     "No mayo",
13     "Light ice",
14     "Well done",
15     "On the side",
16     
17     # Sizes
18     "Venti",
19     "Grande",
20     "Tall",
21     "Large combo",
22     
23     # Brands/Products
24     "Coca-Cola",
25     "Dr Pepper"
26 ]
27 
28 params = {
29     "keyterms_prompt": json.dumps(restaurant_terms)
30 }

Example: Healthcare Application

1 # Medical terminology
2 medical_terms = [
3     # Medications
4     "Lisinopril",
5     "Metformin",
6     "Atorvastatin",
7     "Levothyroxine",
8     
9     # Conditions
10     "Hypertension",
11     "Type 2 Diabetes",
12     "Hyperlipidemia",
13     "Chronic Obstructive Pulmonary Disease",
14     
15     # Procedures
16     "Colonoscopy",
17     "Echocardiogram",
18     "MRI scan",
19     
20     # Staff names
21     "Dr. Rodriguez",
22     "Nurse Patel",
23     "Dr. Sarah Chen"
24 ]
25 
26 params = {
27     "keyterms_prompt": json.dumps(medical_terms)
28 }

Dynamic Keyterms Prompting

Dynamic keyterms prompting allows you to update keyterms during an active streaming session using the UpdateConfiguration message. This is particularly valuable for voice agents that need to adapt recognition context based on conversation flow.

Updating keyterms during a session:

1 # Replace or establish new set of keyterms
2 websocket.send('{"type": "UpdateConfiguration", "keyterms_prompt": ["Universal-3"]}')
3 
4 # Remove keyterms and reset context biasing
5 websocket.send('{"type": "UpdateConfiguration", "keyterms_prompt": []}')

How it works:

Replacing keyterms: Providing a new array completely replaces the existing set. New keyterms take effect immediately.
Clearing keyterms: Sending an empty array [] removes all keyterms and resets context biasing.
Both boosting stages: Dynamic keyterms work with both word-level boosting (native context biasing) and turn-level boosting (metaphone-based).

Voice agent use cases:

Context-aware conversations: Update keyterms based on conversation stage (e.g., switching from menu items to payment terms in a restaurant ordering bot)
Multi-topic support: Adapt vocabulary as the conversation topic changes
Progressive disclosure: Add relevant keyterms as new information becomes available during the conversation
Performance optimization: Remove keyterms that are no longer relevant to reduce processing overhead

For more details, see the Keyterms Prompting Guide.

How Does the Unlimited Concurrency Work?

Unlimited Concurrent Streaming Sessions

Universal-Streaming provides genuine unlimited concurrent streams with:

No hard caps on simultaneous connections
No overage fees for spike traffic
No pre-purchased capacity requirements
Automatic scaling from 5 to 50,000+ streams

How It Works

Pricing Model:

$0.15/hour based on total session duration
Pay only for actual usage, not capacity
Volume discounts available for scale
Optional keyterms: +$0.04/hour

Rate Limits:

Free users: 5 new streams per minute
Pay-as-you-go: 100 new streams per minute
Automatic scaling: When using 70%+ of limit, capacity increases 10% every 60 seconds
Example: From 100 → 146 new streams/min in 5 minutes (610 total concurrent)

Scaling Behavior: Anytime you are using 70% or more of your current limit, your new sessions rate limit will automatically increase and scale up by 10% every 60 seconds. This means within 5 minutes of sustained usage, you can scale from 100 to 146 new streams per minute (for a total of 610 concurrent streams), with unlimited ceiling as your usage grows.

These limits are designed to never interfere with legitimate applications - normal scaling patterns automatically get more capacity before hitting any walls, while protecting against runaway scripts or abuse. Your baseline limit is guaranteed and never decreases, so you can scale smoothly from dozens to thousands of simultaneous streams without artificial barriers or surprise fees.

No Hidden Limits

What AssemblyAI Does:

Instant scaling without configuration
Same performance at any scale
No degradation under load
No surprise bills

The unlimited concurrency means you can:

Handle viral moments without preparation
Scale globally without regional limits
Build platforms without usage anxiety
Focus on your product, not capacity planning
Handle flash sales, breaking news, or unexpected traffic spikes

Real-world example: If you suddenly need to scale from 100 to 1,000 concurrent streams for a product launch, the system automatically adjusts within minutes. No manual intervention, no pre-planning, no emergency capacity requests.