Building a production-ready voice agent: The developer's guide to real-time speech-to-text
Real time speech to text for voice agents with sub-300ms latency, immutable transcripts, and turn detection. Learn WebSocket setup, prompts, and testing.



Building production voice agents requires specialized real-time speech-to-text that handles the unique demands of conversational AI. Unlike batch transcription that processes complete recordings, voice agents need immediate text delivery with sub-300ms latency, perfect accuracy on critical entities like email addresses, and smart turn detection that knows when users finish speaking.
This guide covers the technical requirements for production-ready voice agents, from choosing between Universal Streaming and Universal-3 Pro Streaming models to implementing WebSocket architecture and testing your complete pipeline. You'll learn how to achieve the speed, accuracy, and reliability that make voice conversations feel natural—plus practical integration patterns with popular orchestration frameworks like Pipecat and LiveKit.
How real-time speech-to-text enables natural voice conversations
Real-time speech-to-text converts spoken words into text as someone talks. This means your voice agent gets the words immediately instead of waiting for the person to finish their entire sentence.
Why does this matter? Human conversations flow fast—we expect responses within 300 milliseconds or the chat feels broken. When you call customer service and there's a long pause after you speak, that's what slow speech-to-text feels like.
Voice agents need three things that regular transcription doesn't: immutable transcripts that don't change after they're delivered, smart turn detection to know when users finish talking, and perfect accuracy on important details like email addresses.
The Universal-3 Pro Streaming model works specifically for voice agents. It handles short audio clips under 10 seconds and focuses on the quick turn-taking that makes conversations feel natural.
Technical requirements for production voice agents
Your voice agent needs four things to work well: speed under 300ms, perfect accuracy on critical words, smart turn detection, and transcripts that never change once delivered.
You'll choose between two models. Universal Streaming gives you the fastest speed at the lowest cost for basic voice agents. Universal-3 Pro handles complex accuracy needs like domain-specific words and multilingual switching.
Why latency matters for voice agents
Humans respond to each other in about 200 milliseconds. Your voice agent's entire process—hearing audio, converting to text, understanding meaning, and responding—must happen within this window.
Here's what happens in your voice agent pipeline:
- Audio capture from microphone
- WebSocket transport to your server
- Speech-to-text processing
- Turn detection (knowing when to respond)
- LLM thinking and response generation
- Text-to-speech conversion back to audio
Universal-3 Pro Streaming influences response latency through the `min_turn_silence` parameter. For the `u3-rt-pro` model, the default is 100ms (the 400ms default applies to the older Universal Streaming model). This parameter determines the silence duration before an end-of-turn check is performed. Turn detection also uses `max_turn_silence` (default 1200ms) as a fallback that forces a turn to end after longer silence.
With the 100ms setting, you'll typically see 300–500ms total time. That's slightly slower than Universal Streaming but the accuracy gains make it worth it for most voice agents.
Critical token accuracy vs general word accuracy
Word Error Rate tells you how many words are right overall. But voice agents live or die on specific critical words. When someone says "john.smith@company.com" and your agent hears "johnsmith@company.calm," you've lost important information.
Critical tokens that break voice agents:
- Email addresses: Even small mistakes make them useless
- Phone numbers: Wrong digits mean you can't call back
- Product codes: Mistakes lead to wrong orders
- Short responses: "Yes," "no," "mmhmm" often get dropped completely
Universal-3 Pro Streaming specifically improves accuracy on these problem areas. It catches short affirmative words that traditional models miss and handles complex alphanumeric strings better.
The medical terminology improvement helps healthcare voice agents understand medication names correctly. The accented speech improvement means your global customers get the same quality service regardless of how they speak English.
Prompt-controlled turn detection
Turn detection decides when someone finishes talking. Fire too early and you interrupt mid-thought. Wait too long and conversations drag.
Most systems just wait for silence, but that fails when people pause to think or say "um." Universal-3 Pro uses a smarter approach—it looks at punctuation patterns in the text to understand when sentences are complete.
Universal-3 Pro Streaming includes default prompting that already handles punctuation-based turn detection intelligently. The model uses a default prompt similar to this pattern:
Rules: 1) Always include punctuation in output. 2) Use period/question mark ONLY for complete sentences. 3) Use comma for mid-sentence pauses. 4) Use no punctuation for incomplete trailing speech. 5) Filler words (um, uh, so, like) indicate speaker will continue.
This default behavior reduces interruptions by giving the model clear guidance about when sentences are actually finished. You only need to provide a custom prompt if you want to customize or extend this default behavior for your specific use case.
For Pipecat users, you can let Pipecat handle turn detection while AssemblyAI provides fast, accurate transcripts. Set vad_force_turn_endpoint=True to use this approach.
Why transcript immutability matters
Meeting transcription can change words after the fact—it's just cosmetic. Voice agents can't handle this because transcripts trigger actions immediately.
When your voice agent hears "schedule appointment" and creates a calendar entry, but then the transcript changes to "cancel appointment," you've got a serious problem. The appointment is already scheduled but the transcript says to cancel it.
Problems with changing transcripts:
- Duplicate actions: Creating then canceling the same thing
- Data conflicts: Your database doesn't match what was said
- Complex code: You need extra logic to handle changes
Universal-3 Pro Streaming delivers immutable transcripts. Each Turn event is final when you receive it—no revision handling required.
Real-time STT architecture and integration patterns
You'll implement real-time speech-to-text using WebSocket connections for streaming audio. Most production voice agents use orchestration frameworks that handle the technical details for you, but understanding the underlying architecture helps you debug issues.
WebSocket streaming implementation
WebSocket connections let you send audio chunks continuously and receive text back immediately. You need to send 16kHz, 16-bit PCM mono audio in small chunks—typically 20-50ms of audio per message.
Here's what your implementation looks like with Universal-3 Pro Streaming:
import json
import pyaudio
import websocket
import threading
import time
from urllib.parse import urlencode
API_KEY = "YOUR_API_KEY"
SAMPLE_RATE = 16000
CONNECTION_PARAMS = {
"sample_rate": SAMPLE_RATE,
"speech_model": "u3-rt-pro",
"min_turn_silence": 100,
"max_turn_silence": 1000,
}
API_ENDPOINT_BASE = "wss://streaming.assemblyai.com/v3/ws"
API_ENDPOINT = f"{API_ENDPOINT_BASE}?{urlencode(CONNECTION_PARAMS)}"
def on_message(ws, message):
data = json.loads(message)
if data.get("type") == "Turn":
transcript = data.get("transcript", "")
end_of_turn = data.get("end_of_turn", False)
if end_of_turn:
# Final transcript — send to LLM
print(f"Final: {transcript}")
else:
# Partial — can start pre-emptive LLM generation
print(f"Partial: {transcript}")
elif data.get("type") == "SpeechStarted":
# User started speaking — handle barge-in
print("Speech detected — interrupt agent if speaking")
ws = websocket.WebSocketApp(
API_ENDPOINT,
header={"Authorization": API_KEY},
on_message=on_message,
)
Universal-3 Pro only sends partial results during silence periods, not after every word like Universal Streaming. This works perfectly for voice agents since you only want to act on complete thoughts anyway. For multi-speaker scenarios, you can enable streaming diarization using `speaker_labels: true` to identify different speakers in real-time.
Prompting for voice agent accuracy
Universal-3 Pro Streaming accepts a system prompt as a core parameter. This is how you customize accuracy for your specific use case without maintaining separate word lists.
Three ways to use prompting:
- Entity correction: Include your brand names, product codes, and technical terms directly in the prompt
- Format control: Tell the model how to format phone numbers, emails, and addresses
- Turn detection tuning: Use the Strict Rules pattern alongside your domain instructions
Keep your prompts focused and shorter than you'd use for batch processing. The model works on short audio clips, so long prompts compete with the actual speech for attention.
Integration with orchestration frameworks
Most production voice agents use frameworks that handle WebSocket connections, audio formatting, and state management automatically. These frameworks let you focus on your voice agent logic instead of low-level speech-to-text API details.
You can find best practices for getting started with these frameworks in our docs here.
Testing and production readiness
Voice agent testing differs from regular transcription testing. You need to check speed across your entire pipeline, test accuracy with your specific vocabulary, and validate turn detection with real speech patterns.
Testing latency and accuracy
Measure your complete pipeline from audio input to final action, not just API response time. Track percentiles (50th, 75th, 95th) because the worst cases matter more than averages for user experience.
Critical areas to test:
- Long entities: Email addresses, phone numbers, account numbers with your formatting
- Short responses: "Yes," "no," "okay," "got it"—these get dropped frequently
- Domain words: Your brand names, product codes, industry terminology
- Natural speech: Include thinking pauses, trailing sentences, and filler words
Test with realistic background noise levels. Quiet audio can pick up ambient sounds and interpret them as speech, causing false triggering.
Create test sets that mirror your actual use cases. If you're building a customer service agent, test with frustrated customers speaking quickly. If it's a healthcare agent, include medical terminology and accented speech patterns.
Monitoring and debugging in production
Track four key metrics that directly impact user experience:
Essential monitoring:
- Speed percentiles: Track 50th, 75th, and 95th percentile latency per conversation
- Entity accuracy: Monitor success rates by type (emails vs phone numbers vs product codes)
- Turn detection rates: Measure false positives (interrupting users) and false negatives (waiting too long)
- Connection health: WebSocket drops, reconnection frequency, error rates
Common debugging patterns:
- High latency at worst cases: Check
min_turn_silencesetting and network path - Too many interruptions: Review your prompt against the Strict Rules pattern
- Missing entities: Add missing domain terms to your system prompt
- Connection issues: Monitor WebSocket error codes and implement proper reconnection logic
Set up alerts for latency above 500ms and accuracy drops below your baseline. These indicate problems that directly hurt user experience.
Final words
Building voice agents requires specialized real-time speech-to-text that handles the unique demands of conversational AI. Your implementation needs sub-300ms latency through proper min_turn_silence tuning, exceptional accuracy on critical tokens including short affirmative responses, prompt-controlled turn detection using explicit punctuation rules, and immutable transcripts that never change after delivery.
The choice between Universal Streaming and Universal-3 Pro Streaming depends on your specific needs—basic voice agents benefit from Universal Streaming's speed and cost efficiency, while complex applications requiring entity accuracy, domain prompting, or multilingual support need Universal-3 Pro's advanced capabilities. AssemblyAI's Voice AI platform provides both options with immutable transcripts, prompt-controlled turn detection, industry-leading entity accuracy, and auto-scaling concurrency.
Frequently asked questions
How do I decide between Universal Streaming and Universal-3 Pro Streaming for my voice agent?
Choose Universal Streaming for basic voice agents where speed and cost matter most, and your vocabulary is predictable. Pick Universal-3 Pro when you need perfect accuracy on emails, phone numbers, and short responses, when you want to use prompts for domain-specific terms, or when handling multiple languages in the same conversation.
What makes real-time speech-to-text different from regular batch transcription for voice agents?
Real-time processing delivers text as people speak with responses under 300ms, enabling natural conversation flow. Batch transcription waits for complete audio files and takes seconds—too slow for voice agents that need immediate understanding to respond naturally.
How do I use prompts to reduce interruptions in my voice agent?
Apply the Strict Rules prompt pattern that explicitly defines punctuation usage—periods only for complete sentences, commas for pauses, no punctuation for trailing speech. Include filler words like "um" and "uh" as continuation indicators. This approach reduces false interruptions compared to generic semantic prompts.
What's the best way to test speech-to-text latency for voice agents?
Measure end-to-end timing from audio input to your agent's action trigger, not just API response time. Track 95th percentile latency since worst-case delays determine user experience. Start with min_turn_silence: 100 for lowest latency and test how it affects turn detection quality with your specific audio conditions.
Why does my voice agent keep interrupting users mid-sentence?
Your system likely relies only on silence detection instead of understanding sentence completion. Universal-3 Pro's prompt-controlled turn detection uses punctuation patterns to identify complete thoughts. Implement the Strict Rules prompting pattern and tune min_turn_silence to balance responsiveness with accuracy.
Which voice agent frameworks work with real-time speech-to-text APIs?
Pipecat, LiveKit, and Vapi offer native integrations that handle WebSocket connections and audio formatting automatically. Pipecat provides two modes—one where Pipecat controls turn detection and one where the speech-to-text API handles it directly. Each framework includes documentation and examples for integration.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.






