Best Practices for building Voice Agents

Introduction

AssemblyAI’s Universal-Streaming model represents a breakthrough in speech-to-text technology specifically designed for conversational AI in voice agents. Unlike traditional streaming models that force developers to choose between speed and reliability, Universal-Streaming delivers immutable transcripts in ~300ms with industry-leading accuracy.

What is Universal-Streaming?

Universal-Streaming is AssemblyAI’s speech-to-text model purpose-built for real-time conversational AI. It’s the first streaming model to deliver:

  • Immutable transcripts that never change once received (no retroactive edits)
  • ~300ms latency for natural conversation flow
  • Intelligent turn detection combining acoustic and semantic analysis
  • Industry-leading accuracy trained specifically for conversational speech patterns
  • Unlimited concurrent streams with transparent, usage-based pricing

Unlike traditional streaming models that force you to choose between speed and reliability, Universal-Streaming provides both—enabling natural conversations without awkward pauses or mid-sentence interruptions.

Key innovation: While other models send partial transcripts that constantly change (causing downstream processing issues), Universal-Streaming’s immutable transcripts arrive as “finals” from the start. This enables pre-emptive LLM processing while users are still speaking, dramatically reducing response latency.

Why AssemblyAI for Voice Agents?

Voice agents require critical capabilities that traditional speech-to-text solutions struggle to provide:

Speed without sacrificing accuracy

  • Immutable transcripts in ~300ms enable instant LLM processing
  • No waiting for “final” transcripts that never arrive
  • Pre-emptive generation while users are still speaking

Natural conversation flow

  • Intelligent turn detection understands context, not just silence
  • No more awkward long pauses or mid-sentence interruptions
  • Configurable for your specific use case

Production-ready at scale

  • Unlimited concurrent streams from day one
  • No capacity planning or overage fees
  • Pre-built integrations with LiveKit, Pipecat, Vapi

Transparent pricing

  • $0.15/hour based on session duration
  • Optional keyterms prompting: +$0.04/hour
  • No hidden costs or surprise bills

Universal-Streaming addresses the fundamental challenges of voice agents: delivering both speed and reliability while maintaining natural conversation flow and transparent costs.

What Languages and Features Does Universal-Streaming Support?

Language Support

Universal-Streaming supports two modes:

English-only mode (default)

  • Full feature support including keyterms prompting
  • Optimized for English conversations
  • Best performance and lowest latency

Multilingual mode (beta)

  • Supports: English, Spanish, French, German, Italian, Portuguese
  • Automatic language detection and code-switching
  • Maintains context across language changes
  • Note: Keyterms prompting not currently supported

To enable multilingual mode, set "language": "multi" in connection parameters.

Available Features

Core Streaming Features:

  • Turn-based immutable transcripts (no retroactive edits)
  • Real-time partial transcripts during speech
  • Word-level timestamps and confidence scores
  • Configurable endpointing (semantic + acoustic detection)
  • Force endpoint capability for manual turn control
  • Built-in VAD (Voice Activity Detection)

Accuracy Enhancements:

  • Keyterms Prompting (English-only): Up to 100 custom terms per session
    • 21% better accuracy on domain-specific terminology
    • Word-level and turn-level boosting
    • Cost: +$0.04/hour

Audio Processing:

  • PCM16 and Mu-law encoding support
  • Configurable sample rates (16kHz recommended)
  • Single-channel audio
  • Automatic noise handling
  • Background noise robustness

Text Processing:

  • Optional text formatting (punctuation, capitalization, ITN)
  • Not recommended for voice agents - adds ~200ms latency with no LLM benefit
  • Useful for displaying transcripts to end users

Important Limitations:

  • Timestamps have wide variance in accuracy - use for relative timing only
  • 50ms minimum audio chunk size, 1000ms maximum

Coming Soon (Public Roadmap)

  • Multi-region support: EU deployment for lower latency in Europe
  • Self-hosted deployment: Docker containers for on-premise use
  • Enhanced audio handling: Improved performance with background noise and low-quality audio
  • Speaker diarization: Real-time speaker identification (limited utility for most voice agents)

Check our public roadmap for current development status.

How Can I Get Started with Universal-Streaming?

Basic Setup and Terminal Logging

Here’s a complete script that connects to Universal-Streaming and logs all JSON responses to your terminal.

This script will:

  • Connect to Universal-Streaming WebSocket API
  • Capture audio from your microphone
  • Log every JSON message received (partial and final transcripts)
  • Highlight key fields like utterance, end_of_turn, and transcript
  • Show when utterances are complete and ready for LLM processing
  • Show when turns are predicted to have ended so your voice agent can respond
1import asyncio
2import json
3import websockets
4from urllib.parse import urlencode
5import pyaudio
6import threading
7from queue import Queue
8
9# Configuration
10API_KEY = "YOUR_API_KEY_HERE" # Replace with your AssemblyAI API key
11SAMPLE_RATE = 16000
12CHUNK_SIZE = 1024
13
14# WebSocket connection parameters
15CONNECTION_PARAMS = {
16 "sample_rate": SAMPLE_RATE,
17 "format_turns": False, # CRITICAL: Unformatted for fastest response (~200ms savings)
18 "end_of_turn_confidence_threshold": 0.4, # Balanced turn detection
19 "min_end_of_turn_silence_when_confident": 160, # ms after confident EOT
20 "max_turn_silence": 1280 # Acoustic fallback threshold
21}
22
23# Build WebSocket URL
24API_ENDPOINT_BASE = "wss://streaming.assemblyai.com/v3/ws"
25API_ENDPOINT = f"{API_ENDPOINT_BASE}?{urlencode(CONNECTION_PARAMS)}"
26
27# Audio queue for thread-safe audio handling
28audio_queue = Queue()
29
30def audio_callback(in_data, frame_count, time_info, status):
31 """Callback for audio stream - adds audio to queue"""
32 audio_queue.put(in_data)
33 return (None, pyaudio.paContinue)
34
35async def send_audio(websocket):
36 """Send audio data to WebSocket"""
37 print("📤 Starting audio sender...")
38 while True:
39 if not audio_queue.empty():
40 audio_data = audio_queue.get()
41 await websocket.send(audio_data)
42 else:
43 await asyncio.sleep(0.01)
44
45async def receive_transcripts(websocket):
46 """Receive and log all transcripts"""
47 print("📥 Starting transcript receiver...")
48 while True:
49 try:
50 message = await websocket.recv()
51 data = json.loads(message)
52
53 # Log complete JSON response
54 print("\n" + "="*50)
55 print("📝 RECEIVED MESSAGE:")
56 print(json.dumps(data, indent=2))
57
58 # Highlight important fields
59 if data.get("type") == "Turn":
60 print("\n🔍 KEY FIELDS:")
61 print(f" Turn Order: {data.get('turn_order')}")
62 print(f" Transcript: '{data.get('transcript')}'")
63 print(f" End of Turn: {data.get('end_of_turn')}")
64 print(f" EOT Confidence: {data.get('end_of_turn_confidence', 0):.3f}")
65 print(f" Utterance: {data.get('utterance')}")
66
67 # KEY INSIGHT: Utterance field enables pre-emptive generation
68 if data.get("utterance"):
69 print("\n✅ UTTERANCE AVAILABLE - Ready for LLM processing!")
70 print(" 💡 Start generating LLM response now, don't wait for end_of_turn")
71
72 # KEY INSIGHT: end_of_turn signals when to respond
73 if data.get("end_of_turn"):
74 print("\n🎯 END OF TURN - User finished speaking, agent can respond")
75
76 except websockets.exceptions.ConnectionClosed:
77 print("❌ Connection closed")
78 break
79 except Exception as e:
80 print(f"❌ Error: {e}")
81
82async def main():
83 """Main function to coordinate streaming"""
84 print("🚀 Universal-Streaming Terminal Logger")
85 print(f"📡 Connecting to: {API_ENDPOINT_BASE}")
86 print(f"🔧 Configuration: {json.dumps(CONNECTION_PARAMS, indent=2)}")
87 print("-" * 50)
88
89 # Set up audio stream
90 p = pyaudio.PyAudio()
91 stream = p.open(
92 format=pyaudio.paInt16,
93 channels=1,
94 rate=SAMPLE_RATE,
95 input=True,
96 frames_per_buffer=CHUNK_SIZE,
97 stream_callback=audio_callback
98 )
99
100 # Connect to WebSocket with auth header
101 headers = {"Authorization": API_KEY}
102
103 try:
104 async with websockets.connect(API_ENDPOINT, extra_headers=headers) as websocket:
105 print("✅ Connected to Universal-Streaming!")
106 print("🎤 Start speaking... (Press Ctrl+C to stop)\n")
107
108 # Start audio stream
109 stream.start_stream()
110
111 # Run send and receive concurrently
112 await asyncio.gather(
113 send_audio(websocket),
114 receive_transcripts(websocket)
115 )
116
117 except KeyboardInterrupt:
118 print("\n👋 Stopping...")
119 finally:
120 stream.stop_stream()
121 stream.close()
122 p.terminate()
123 print("✅ Cleaned up resources")
124
125if __name__ == "__main__":
126 asyncio.run(main())

Installation Requirements

$pip install websockets pyaudio

How Do I Build a Voice Agent with AssemblyAI?

AssemblyAI provides speech-to-text only today - you’ll need additional providers for a complete voice agent:

Complete Voice Agent Stack

  1. Speech-to-Text (STT): AssemblyAI Universal-Streaming
  2. Large Language Model (LLM): OpenAI, Anthropic, Gemini, Cerebras, etc.
  3. Text-to-Speech (TTS): Rime, Cartesia, ElevenLabs, etc.
  4. Orchestration: LiveKit, Pipecat, Vapi, or custom build

Pre-Built Integrations

LiveKit Agents (Recommended Quick Start) LiveKit provides the fastest path to a working voice agent with AssemblyAI:

1# LiveKit + AssemblyAI Quick Start
2from livekit import agents
3from livekit.plugins import assemblyai, openai, rime
4
5async def create_voice_agent():
6 # Initialize AssemblyAI STT
7 stt = assemblyai.STT(
8 api_key="your_assemblyai_key",
9 end_of_turn_confidence_threshold=0.4,
10 min_end_of_turn_silence_when_confident=160,
11 format_turns=False # CRITICAL: Faster without formatting
12 )
13
14 # Add LLM
15 llm = openai.LLM(api_key="your_openai_key")
16
17 # Add TTS
18 tts = rime.TTS(api_key="your_rime_key")
19
20 # Create agent
21 agent = agents.VoiceAssistant(
22 stt=stt,
23 llm=llm,
24 tts=tts,
25 turn_detection="stt" # Use AssemblyAI's turn detection
26 )
27
28 return agent

Pipecat by Daily Pipecat is an open-source framework for conversational AI which allows for maximum customizability in your voice agent:

1from pipecat.services.assemblyai import AssemblyAISTTService
2from pipecat.services.openai import OpenAILLMService
3from pipecat.services.rime import RimeTTSService
4
5# Configure services
6stt = AssemblyAISTTService(
7 api_key="your_key",
8 connection_params=AssemblyAIConnectionParams(
9 end_of_turn_confidence_threshold=0.4,
10 min_end_of_turn_silence_when_confident=160, # ms after confident EOT
11 format_turns=False # CRITICAL: Faster without formatting
12 ),
13 vad_force_turn_endpoint=False, # Rely on AssemblyAI's EOT, not VAD
14)
15llm = OpenAILLMService(api_key="your_key")
16tts = RimeTTSService(api_key="your_key")

Vapi Vapi is a developer platform that handles voice agent backend infrastructure:

  1. Go to Assistants tab in Vapi dashboard
  2. Select your assistant → Transcriber tab
  3. Choose “Assembly AI” as provider
  4. Toggle on “Universal Streaming API”
  5. Disable “Format Turns” for best latency

See our Vapi integration guide for detailed setup.

Integration Resources and Full Examples

How Do I Optimize for Latency with Universal-Streaming?

Fastest Latency Configuration

1# Maximum speed configuration
2latency_optimized_params = {
3 "sample_rate": 16000,
4 "encoding": "pcm_s16le",
5
6 # CRITICAL latency optimizations
7 "format_turns": False, # NEVER use for voice agents - saves ~200ms
8
9 # Aggressive turn detection for quick responses
10 "end_of_turn_confidence_threshold": 0.4, # Lower = faster detection
11 "min_end_of_turn_silence_when_confident": 160, # Minimal silence required
12 "max_turn_silence": 800, # Faster fallback
13
14 # Audio optimization
15 "chunk_size": 512 # Smaller chunks = lower latency
16}

Latency Optimization Best Practices

1. Never use Text Formatting for Voice Agents

1latency_optimized_params = {
2 "format_turns": False # Saves ~200ms per response
3}

Why? LLMs don’t need formatting - raw text works perfectly. Formatting adds ~200ms latency with zero benefit for voice agents. The LLM processes "hello world" exactly the same as "Hello, world!".

When NOT to disable formatting:

  • Displaying transcripts to end users (captions, transcription apps)
  • Recording/archiving conversations for human review
  • Any scenario where humans read the transcript directly

2. Grab the utterance field to process Immutable Partials for Pre-emptive Generation

1if data.get("utterance"): # Complete utterance ready
2 # Start LLM immediately - don't wait for end_of_turn
3 asyncio.create_task(generate_response(data["utterance"]))

This is especially powerful when using external turn detection models. By default, LiveKit and Pipecat leverage this configuration for pre-emptive generation.

3. Use Aggressive Turn Detection

1# For rapid back-and-forth (customer service, quick confirmations)
2latency_optimized_params = {
3 "end_of_turn_confidence_threshold": 0.4,
4 "min_end_of_turn_silence_when_confident": 160
5}

Note that we recommend testing the end of turn confidence for your use case. You can find our guide on turn detection here.

4. Optimize Audio Pipeline

  • Use 16kHz sample rate (balances quality and bandwidth)
  • Send smaller audio chunks (512-1024 bytes)
  • Minimize buffering in your audio capture
  • Keep network latency low (use same region as AssemblyAI servers when possible)

5. Leverage Built-in VAD AssemblyAI includes VAD in the model. You can substitute Universal-Streaming entirely for VAD by adjusting min_end_of_turn_silence_when_confident, though this increases latency until the silence threshold has passed. This is recommended most for custom builds where a voice agent orchestrator is not managing the VAD for you.

Voice Agent Specific Tips

  • Use end_of_turn with VAD: Combine the end_of_turn parameter with your VAD to determine if the user is continuing to speak or if you can safely interrupt
  • Skip unnecessary features: Avoid format_turns and wait for unformatted transcripts only for the fastest response
  • Monitor utterance: Use this to pre-emptively generate LLM responses while confirming a turn has ended
  • Monitor end_of_turn_confidence: Use this to fine-tune your interruption logic
  • Configure for your use case: Use Aggressive, Balanced, or Conservative presets based on your needs

Latency vs. Accuracy Trade-offs

ConfigurationTTFTEnd-to-EndBest For
Aggressive~200ms~500msQuick confirmations, IVR
Balanced~300ms~800msMost conversations
Conservative~400ms~1200msComplex instructions

How Can I Use Turn Detection in My Voice Agent?

Understanding Turn Detection

Universal-Streaming uses dual detection methods for turn detection:

Semantic Detection (Primary)

  • Neural network analyzes meaning and context
  • Triggers when end_of_turn_confidence > end_of_turn_confidence_threshold
  • Understands natural speech patterns
  • Handles “um”, pauses, incomplete thoughts
  • Minimum silence: min_end_of_turn_silence_when_confident (default 400ms)

Acoustic Detection (Fallback)

  • Traditional silence-based VAD
  • Triggers after max_turn_silence duration (default 1280ms)
  • Ensures reliability for edge cases
  • Works even when semantic model has low confidence

When either method detects an end-of-turn, the model returns end_of_turn=True in the response.

Configuration Examples

1# Aggressive - Fast customer service
2# Best for: IVR, order confirmations, yes/no questions
3aggressive_params = {
4 "end_of_turn_confidence_threshold": 0.3, # Lower threshold = faster detection
5 "min_end_of_turn_silence_when_confident": 160, # Less silence needed
6 "max_turn_silence": 800 # Quick acoustic fallback
7}
8
9# Balanced - Natural conversations (DEFAULT)
10# Best for: Customer support, tech support, general conversations
11balanced_params = {
12 "end_of_turn_confidence_threshold": 0.4,
13 "min_end_of_turn_silence_when_confident": 400,
14 "max_turn_silence": 1280
15}
16
17# Conservative - Medical dictation, complex instructions
18# Best for: Healthcare, legal, detailed technical support
19conservative_params = {
20 "end_of_turn_confidence_threshold": 0.5, # Higher threshold = more patient
21 "min_end_of_turn_silence_when_confident": 560, # More silence before confirming
22 "max_turn_silence": 2000 # Longer acoustic fallback
23}

Parameter Explanations:

  • end_of_turn_confidence_threshold: Raise to wait for higher confidence before ending turn; lower for faster responses
  • min_end_of_turn_silence_when_confident: Amount of silence required after model is confident turn has ended
  • max_turn_silence: Maximum silence before forcing turn end (acoustic fallback)

Advanced Turn Control

Disable Model-Based Detection

1# Use VAD-only mode (silence-based only)
2params = {"end_of_turn_confidence_threshold": 1.0}
3
4# Use an external turn detection model (fastest possible)
5params = {"end_of_turn_confidence_threshold": 0.01}

Force Manual Turn Ending

1# Send ForceEndpoint event to immediately end current turn
2await websocket.send(json.dumps({
3 "type": "ForceEndpoint"
4}))

Monitoring Turn Detection

1def analyze_turn_detection(data):
2 eot_confidence = data.get("end_of_turn_confidence", 0)
3
4 if eot_confidence > 0.7:
5 print("High confidence turn end - user definitely finished")
6 elif eot_confidence > 0.4:
7 print("Moderate confidence - may continue speaking")
8 else:
9 print("Low confidence - likely mid-sentence or thinking")

How Can I Use Immutable Partials for Pre-emptive Generation?

Understanding the Utterance Field

The utterance field in Universal-Streaming provides the complete finalized text when an utterance ends, even if the turn hasn’t ended yet. This is crucial for pre-emptive LLM generation:

1# Example message showing utterance field usage
2{
3 "turn_order": 1,
4 "turn_is_formatted": false,
5 "end_of_turn": false, # Turn NOT ended yet
6 "transcript": "i am a voice",
7 "end_of_turn_confidence": 0.454,
8 "utterance": "I am a voice agent.", # Complete utterance for pre-generation!
9 "type": "Turn"
10}

Key Insight: The utterance appeared even though end_of_turn=false. This means you can start processing the LLM response before the turn officially ends, saving 200-500ms of latency.

Pre-emptive Generation Strategy

When you receive a non-empty utterance field, you can immediately start LLM processing:

1async def handle_streaming_message(data):
2 utterance_text = data.get("utterance")
3 transcript = data.get("transcript")
4 is_final = data.get("end_of_turn")
5 turn_order = data.get("turn_order")
6
7 # Check for utterance completion (pre-emptive opportunity)
8 if utterance_text:
9 # FASTEST: Start LLM processing on complete utterance
10 # even though turn hasn't ended
11 print(f"🚀 Pre-emptive generation for: '{utterance_text}'")
12
13 # Start async LLM call immediately
14 llm_task = asyncio.create_task(
15 generate_llm_response(utterance_text, turn_order)
16 )
17
18 # Cache for later use
19 utterance_cache[turn_order] = {
20 "utterance": utterance_text,
21 "llm_task": llm_task
22 }
23
24 elif is_final:
25 # Turn has ended - use pre-computed response if available
26 if turn_order in utterance_cache:
27 # Response already computing or ready!
28 llm_response = await utterance_cache[turn_order]["llm_task"]
29 print(f"✅ Using pre-computed response (saved 200-500ms!)")
30 return llm_response
31 else:
32 # Fallback: generate response now
33 return await generate_llm_response(transcript, turn_order)
34
35async def generate_llm_response(text, turn_order):
36 """Generate LLM response for utterance"""
37 response = await llm.complete(
38 prompt=f"Respond to: {text}",
39 metadata={"turn_order": turn_order}
40 )
41 return response

When Utterance Field Appears

The utterance field appears in these scenarios:

  1. End of utterance, NOT end of turn (Most useful for pre-generation):
1{
2 "end_of_turn": false,
3 "utterance": "Hi my name is sonny", // Complete utterance ready
4 "end_of_turn_confidence": 0.454 // Not confident enough to end turn
5}
  1. End of utterance AND end of turn (Standard completion):
1{
2 "end_of_turn": true,
3 "utterance": "Hi my name is sonny",
4 "end_of_turn_confidence": 0.5005 // Confident enough to end turn
5}
  1. During partial transcripts (Empty, not useful):
1{
2 "end_of_turn": false,
3 "utterance": "", // Empty during partials
4 "transcript": "hi my"
5}

Benefits of Using Utterance Field

  1. Maximum Speed: Start LLM processing before turn ends
  2. Natural Conversations: Handle pauses without waiting for silence
  3. Reduced Latency: Save 200-500ms by pre-computing responses
  4. Better UX: Agent responds instantly when user actually stops

Real-World Example

1# Voice agent with pre-emptive generation
2class FastVoiceAgent:
3 def __init__(self):
4 self.utterance_cache = {}
5
6 async def process_stream(self, data):
7 turn_order = data.get("turn_order")
8 utterance = data.get("utterance")
9
10 # Log for debugging
11 print(f"Turn {turn_order}: utterance='{utterance}' eot={data.get('end_of_turn')}")
12
13 if utterance and not data.get("end_of_turn"):
14 # Pre-emptive path: ~300ms head start!
15 print(f"⚡ Starting pre-emptive generation for turn {turn_order}")
16 self.utterance_cache[turn_order] = asyncio.create_task(
17 self.generate_response(utterance)
18 )
19
20 if data.get("end_of_turn"):
21 # Turn ended - use cached or generate now
22 if turn_order in self.utterance_cache:
23 response = await self.utterance_cache[turn_order]
24 print(f"🎯 Pre-computed response ready instantly!")
25 else:
26 response = await self.generate_response(data.get("transcript"))
27 print(f"⏱️ Generated response on-demand")
28
29 return response

Voice agent platforms like LiveKit and Pipecat automatically use the utterance field for pre-emptive generation with optimal settings configured. If you’re using these platforms, this logic is already implemented for you.

How Do I Process Messages from the Universal-Streaming API?

Message Sequence Flow

Universal-Streaming sends messages in a specific sequence as users speak. Here’s a complete example of someone saying “Hi my name is Sonny. I am a voice agent.”

1. Session Initialization

1{
2 "type": "Begin",
3 "id": "de5d9927-73a6-4be8-b52d-b4c07be37e6b",
4 "expires_at": 1759796682
5}

2. Partial Transcripts (during “Hi my name is Sonny”)

1{
2 "type": "Turn",
3 "turn_order": 0,
4 "turn_is_formatted": false,
5 "end_of_turn": false,
6 "transcript": "", // Empty initially
7 "utterance": "", // Empty during partials
8 "end_of_turn_confidence": 0.017,
9 "words": [
10 {
11 "text": "hi",
12 "word_is_final": false // Still processing
13 }
14 ]
15}

Words progressively finalize:

1{
2 "transcript": "hi my name is", // Growing transcript
3 "utterance": "", // Still empty - not end of utterance yet
4 "end_of_turn": false,
5 "words": [
6 {"text": "hi", "word_is_final": true},
7 {"text": "my", "word_is_final": true},
8 {"text": "name", "word_is_final": true},
9 {"text": "is", "word_is_final": true},
10 {"text": "sonny", "word_is_final": false} // Latest word
11 ]
12}

3. End of Utterance (Key moment for pre-emptive generation!)

1{
2 "type": "Turn",
3 "turn_order": 0,
4 "turn_is_formatted": false,
5 "end_of_turn": true, // Turn also ends in this case
6 "transcript": "hi my name is sonny",
7 "utterance": "Hi my name is sonny", // ⚡ COMPLETE UTTERANCE!
8 "end_of_turn_confidence": 0.5005,
9 "words": [/* all words with word_is_final: true */]
10}

4. Formatted Final (if format_turns=true)

1{
2 "turn_order": 0,
3 "turn_is_formatted": true,
4 "end_of_turn": true,
5 "transcript": "Hi, my name is Sonny.", // Formatted with punctuation
6 "utterance": "", // Empty after formatting
7}

5. New Turn Continues (“I am a voice agent”)

Sometimes the turn ends but the user keeps speaking:

1{
2 "turn_order": 1, // New turn number
3 "end_of_turn": false,
4 "transcript": "i am a voice",
5 "utterance": "", // Empty during partials
6}

6. Utterance Completes BUT Turn Doesn’t End (Pre-emptive opportunity!)

1{
2 "turn_order": 1,
3 "end_of_turn": false, // ⚠️ Turn NOT ended
4 "transcript": "i am a voice",
5 "utterance": "I am a voice agent.", // ⚡ Complete utterance ready!
6 "end_of_turn_confidence": 0.454 // Not confident enough to end
7}

7. Turn Finally Ends

1{
2 "turn_order": 1,
3 "end_of_turn": true, // Now turn ends
4 "transcript": "i am a voice agent",
5 "utterance": "", // Empty when turn ends
6 "end_of_turn_confidence": 0.751
7}

Processing Strategy

1class TranscriptProcessor:
2 def __init__(self):
3 self.current_turn = 0
4 self.pre_computed_responses = {}
5
6 async def process_message(self, data):
7 msg_type = data.get("type")
8
9 if msg_type == "Begin":
10 print(f"Session started: {data.get('id')}")
11
12 elif msg_type == "Turn":
13 return await self.process_turn(data)
14
15 elif msg_type == "Termination":
16 print(f"Session ended: {data.get('message')}")
17
18 async def process_turn(self, data):
19 turn_order = data.get("turn_order")
20 transcript = data.get("transcript")
21 utterance = data.get("utterance")
22 is_final = data.get("end_of_turn")
23 eot_confidence = data.get("end_of_turn_confidence", 0)
24
25 # Track turn changes
26 if turn_order != self.current_turn:
27 self.current_turn = turn_order
28 print(f"\n🔄 New turn #{turn_order}")
29
30 # Monitor progress
31 print(f" Confidence: {eot_confidence:.3f} | Transcript: '{transcript}'")
32
33 # KEY: Check for utterance completion (pre-emptive opportunity)
34 if utterance:
35 print(f" ⚡ UTTERANCE COMPLETE: '{utterance}' - Starting pre-generation!")
36
37 # Start LLM processing immediately
38 self.pre_computed_responses[turn_order] = asyncio.create_task(
39 self.generate_response(utterance)
40 )
41
42 # Handle turn completion
43 if is_final:
44 print(f" ✅ TURN COMPLETE: '{transcript}'")
45
46 # Use pre-computed response if available
47 if turn_order in self.pre_computed_responses:
48 response = await self.pre_computed_responses[turn_order]
49 print(f" 🎯 Using pre-computed response (instant!)")
50 else:
51 response = await self.generate_response(transcript)
52 print(f" ⏱️ Generating response now")
53
54 return response
55
56 async def generate_response(self, text):
57 # Your LLM call
58 return f"Response to: {text}"

Key Fields to Watch

FieldPurposeWhen to Act
utteranceComplete utterance textNon-empty = start pre-generation
end_of_turnTurn completion flagtrue = process final response
end_of_turn_confidenceTurn ending confidenceMonitor for debugging
word_is_finalWord finalizationAll true = utterance ending soon
turn_orderTurn counterChanges = new speaking turn

Best Practices Summary

Voice Agents - Speed Above All:

  • Grab utterance parameter immediately for fastest response
  • Never use format_turns (adds latency with no LLM benefit)
  • Use end_of_turn + VAD to determine safe interruption points
  • Configure Aggressive/Balanced/Conservative presets for your use case
  • Process unformatted transcripts only

If you’re using LiveKit or Pipecat, this message processing logic is already optimized and configured for you automatically. You don’t need to implement it yourself.

How Can I Improve the Accuracy of My Transcription?

Using Keyterms Prompting

Keyterms Prompting boosts accuracy for domain-specific vocabulary—product names, people, technical terms:

1# Include keyterms in connection parameters
2keyterms = [
3 "AssemblyAI",
4 "Universal-Streaming",
5 "Baconator",
6 "Dr. Rodriguez",
7 "iPhone 15 Pro",
8 "NASDAQ",
9 "PostgreSQL"
10]
11
12CONNECTION_PARAMS = {
13 "keyterms_prompt": json.dumps(keyterms) # Up to 100 terms
14}
15
16# Build URL with keyterms
17API_ENDPOINT = f"{API_ENDPOINT_BASE}?{urlencode(CONNECTION_PARAMS)}"

Keyterms Best Practices

DO Include:

  • Proper names and people (“Dr. Sarah Chen”, “Keanu Reeves”)
  • Product names (“MacBook Pro M3”, “iPhone 15 Pro”, “Baconator”)
  • Technical terminology (“Kubernetes”, “PostgreSQL”, “OAuth”, “React Native”)
  • Company-specific jargon (“Q3 roadmap”, “OKRs”, “CSAT score”)
  • Menu items or SKUs (“Grande Latte”, “SKU-12345”)
  • Domain-specific acronyms (“HIPAA”, “SOC 2”, “GDPR”)
  • Brand names (“AssemblyAI”, “LiveKit”, “Pipecat”)
  • Up to 50 characters per term

DON’T Include:

  • Common English words (“hello”, “information”, “important”)
  • Single letters or very short terms (“a”, “it”, “is”)
  • More than 100 terms total
  • Generic phrases (“thank you”, “have a nice day”)
  • Punctuation or special characters

Performance Impact

With Keyterms Prompting enabled:

  • 21% better accuracy on domain-specific terms
  • No impact on streaming latency
  • Additional cost: $0.04/hour
  • Two-stage boosting: word-level during streaming + turn-level after completion

How Keyterms Works

Word-level boosting (during streaming)

  • Model biased during inference to recognize keyterms
  • Happens in real-time as words are emitted
  • Provides immediate accuracy improvements

Turn-level boosting (after turn ends)

  • Additional post-processing pass with full context
  • Only available if format_turns=true
  • For voice agents: skip this by keeping format_turns=false

Example: Restaurant Ordering Bot

1# Restaurant-specific keyterms
2restaurant_terms = [
3 # Menu items
4 "Baconator",
5 "Frosty",
6 "Dave's Single",
7 "Biggie Bag",
8 "Spicy Chicken Sandwich",
9
10 # Customizations
11 "Extra pickles",
12 "No mayo",
13 "Light ice",
14 "Well done",
15 "On the side",
16
17 # Sizes
18 "Venti",
19 "Grande",
20 "Tall",
21 "Large combo",
22
23 # Brands/Products
24 "Coca-Cola",
25 "Dr Pepper"
26]
27
28params = {
29 "keyterms_prompt": json.dumps(restaurant_terms)
30}

Example: Healthcare Application

1# Medical terminology
2medical_terms = [
3 # Medications
4 "Lisinopril",
5 "Metformin",
6 "Atorvastatin",
7 "Levothyroxine",
8
9 # Conditions
10 "Hypertension",
11 "Type 2 Diabetes",
12 "Hyperlipidemia",
13 "Chronic Obstructive Pulmonary Disease",
14
15 # Procedures
16 "Colonoscopy",
17 "Echocardiogram",
18 "MRI scan",
19
20 # Staff names
21 "Dr. Rodriguez",
22 "Nurse Patel",
23 "Dr. Sarah Chen"
24]
25
26params = {
27 "keyterms_prompt": json.dumps(medical_terms)
28}

How Does the Unlimited Concurrency Work?

Unlimited Concurrent Streaming Sessions

Universal-Streaming provides genuine unlimited concurrent streams with:

  • No hard caps on simultaneous connections
  • No overage fees for spike traffic
  • No pre-purchased capacity requirements
  • Automatic scaling from 5 to 50,000+ streams

How It Works

Pricing Model:

  • $0.15/hour based on total session duration
  • Pay only for actual usage, not capacity
  • Volume discounts available for scale
  • Optional keyterms: +$0.04/hour

Rate Limits:

  • Free users: 5 new streams per minute
  • Pay-as-you-go: 100 new streams per minute
  • Automatic scaling: When using 70%+ of limit, capacity increases 10% every 60 seconds
  • Example: From 100 → 146 new streams/min in 5 minutes (610 total concurrent)

Scaling Behavior: Anytime you are using 70% or more of your current limit, your new sessions rate limit will automatically increase and scale up by 10% every 60 seconds. This means within 5 minutes of sustained usage, you can scale from 100 to 146 new streams per minute (for a total of 610 concurrent streams), with unlimited ceiling as your usage grows.

These limits are designed to never interfere with legitimate applications - normal scaling patterns automatically get more capacity before hitting any walls, while protecting against runaway scripts or abuse. Your baseline limit is guaranteed and never decreases, so you can scale smoothly from dozens to thousands of simultaneous streams without artificial barriers or surprise fees.

No Hidden Limits

What AssemblyAI Does:

  • Instant scaling without configuration
  • Same performance at any scale
  • No degradation under load
  • No surprise bills

The unlimited concurrency means you can:

  • Handle viral moments without preparation
  • Scale globally without regional limits
  • Build platforms without usage anxiety
  • Focus on your product, not capacity planning
  • Handle flash sales, breaking news, or unexpected traffic spikes

Real-world example: If you suddenly need to scale from 100 to 1,000 concurrent streams for a product launch, the system automatically adjusts within minutes. No manual intervention, no pre-planning, no emergency capacity requests.


Additional Resources