Insights & Use Cases
March 18, 2026

Building a production-ready voice agent: The developer's guide to real-time speech-to-text

Real time speech to text for voice agents with sub-300ms latency, immutable transcripts, and turn detection. Learn WebSocket setup, prompts, and testing.

Reviewed by
No items found.
Table of contents

Building production voice agents requires specialized real-time speech-to-text that handles the unique demands of conversational AI. Unlike batch transcription that processes complete recordings, voice agents need immediate text delivery with sub-300ms latency, perfect accuracy on critical entities like email addresses, and smart turn detection that knows when users finish speaking.

This guide covers the technical requirements for production-ready voice agents, from choosing between Universal Streaming and Universal-3 Pro Streaming models to implementing WebSocket architecture and testing your complete pipeline. You'll learn how to achieve the speed, accuracy, and reliability that make voice conversations feel natural—plus practical integration patterns with popular orchestration frameworks like Pipecat and LiveKit.

How real-time speech-to-text enables natural voice conversations

Real-time speech-to-text converts spoken words into text as someone talks. This means your voice agent gets the words immediately instead of waiting for the person to finish their entire sentence.

Why does this matter? Human conversations flow fast—we expect responses within 300 milliseconds or the chat feels broken. When you call customer service and there's a long pause after you speak, that's what slow speech-to-text feels like.

Voice agents need three things that regular transcription doesn't: immutable transcripts that don't change after they're delivered, smart turn detection to know when users finish talking, and perfect accuracy on important details like email addresses.

Aspect

Real-time STT

Batch STT

Processing timing

As you speak

After recording ends

Speed

Under 300ms

Several seconds

When transcripts change

Never

Can be revised later

Best for

Voice agents, live calls

Meeting notes, subtitles

The Universal-3 Pro Streaming model works specifically for voice agents. It handles short audio clips under 10 seconds and focuses on the quick turn-taking that makes conversations feel natural.

Technical requirements for production voice agents

Your voice agent needs four things to work well: speed under 300ms, perfect accuracy on critical words, smart turn detection, and transcripts that never change once delivered.

You'll choose between two models. Universal Streaming gives you the fastest speed at the lowest cost for basic voice agents. Universal-3 Pro handles complex accuracy needs like domain-specific words and multilingual switching.

What You Need

Target

Why It Matters

How to Check

Speed

Under 300ms total

Keeps conversation natural

Time from audio to action

Critical word accuracy

Over 95% correct

Business logic depends on it

Test with your vocabulary

Turn detection

Under 25% false interruptions

Prevents cutting people off

Test with natural speech

Stable transcripts

No changes after delivery

Prevents duplicate actions

Check for revision events

Why latency matters for voice agents

Humans respond to each other in about 200 milliseconds. Your voice agent's entire process—hearing audio, converting to text, understanding meaning, and responding—must happen within this window.

Here's what happens in your voice agent pipeline:

  • Audio capture from microphone
  • WebSocket transport to your server
  • Speech-to-text processing
  • Turn detection (knowing when to respond)
  • LLM thinking and response generation
  • Text-to-speech conversion back to audio

Universal-3 Pro Streaming influences response latency through the `min_turn_silence` parameter. For the `u3-rt-pro` model, the default is 100ms (the 400ms default applies to the older Universal Streaming model). This parameter determines the silence duration before an end-of-turn check is performed. Turn detection also uses `max_turn_silence` (default 1200ms) as a fallback that forces a turn to end after longer silence.

With the 100ms setting, you'll typically see 300–500ms total time. That's slightly slower than Universal Streaming but the accuracy gains make it worth it for most voice agents.

Critical token accuracy vs general word accuracy

Word Error Rate tells you how many words are right overall. But voice agents live or die on specific critical words. When someone says "john.smith@company.com" and your agent hears "johnsmith@company.calm," you've lost important information.

Critical tokens that break voice agents:

  • Email addresses: Even small mistakes make them useless
  • Phone numbers: Wrong digits mean you can't call back
  • Product codes: Mistakes lead to wrong orders
  • Short responses: "Yes," "no," "mmhmm" often get dropped completely

Universal-3 Pro Streaming specifically improves accuracy on these problem areas. It catches short affirmative words that traditional models miss and handles complex alphanumeric strings better.

The medical terminology improvement helps healthcare voice agents understand medication names correctly. The accented speech improvement means your global customers get the same quality service regardless of how they speak English.

Prompt-controlled turn detection

Turn detection decides when someone finishes talking. Fire too early and you interrupt mid-thought. Wait too long and conversations drag.

Most systems just wait for silence, but that fails when people pause to think or say "um." Universal-3 Pro uses a smarter approach—it looks at punctuation patterns in the text to understand when sentences are complete.

Universal-3 Pro Streaming includes default prompting that already handles punctuation-based turn detection intelligently. The model uses a default prompt similar to this pattern:

Rules: 1) Always include punctuation in output. 2) Use period/question mark ONLY for complete sentences. 3) Use comma for mid-sentence pauses. 4) Use no punctuation for incomplete trailing speech. 5) Filler words (um, uh, so, like) indicate speaker will continue.

This default behavior reduces interruptions by giving the model clear guidance about when sentences are actually finished. You only need to provide a custom prompt if you want to customize or extend this default behavior for your specific use case.

For Pipecat users, you can let Pipecat handle turn detection while AssemblyAI provides fast, accurate transcripts. Set vad_force_turn_endpoint=True to use this approach.

Why transcript immutability matters

Meeting transcription can change words after the fact—it's just cosmetic. Voice agents can't handle this because transcripts trigger actions immediately.

When your voice agent hears "schedule appointment" and creates a calendar entry, but then the transcript changes to "cancel appointment," you've got a serious problem. The appointment is already scheduled but the transcript says to cancel it.

Problems with changing transcripts:

  • Duplicate actions: Creating then canceling the same thing
  • Data conflicts: Your database doesn't match what was said
  • Complex code: You need extra logic to handle changes

Universal-3 Pro Streaming delivers immutable transcripts. Each Turn event is final when you receive it—no revision handling required.

Real-time STT architecture and integration patterns

You'll implement real-time speech-to-text using WebSocket connections for streaming audio. Most production voice agents use orchestration frameworks that handle the technical details for you, but understanding the underlying architecture helps you debug issues.

WebSocket streaming implementation

WebSocket connections let you send audio chunks continuously and receive text back immediately. You need to send 16kHz, 16-bit PCM mono audio in small chunks—typically 20-50ms of audio per message.

Here's what your implementation looks like with Universal-3 Pro Streaming:

import json
import pyaudio
import websocket
import threading
import time
from urllib.parse import urlencode

API_KEY = "YOUR_API_KEY"
SAMPLE_RATE = 16000

CONNECTION_PARAMS = {
    "sample_rate": SAMPLE_RATE,
    "speech_model": "u3-rt-pro",
    "min_turn_silence": 100,
    "max_turn_silence": 1000,
}

API_ENDPOINT_BASE = "wss://streaming.assemblyai.com/v3/ws"
API_ENDPOINT = f"{API_ENDPOINT_BASE}?{urlencode(CONNECTION_PARAMS)}"

def on_message(ws, message):
    data = json.loads(message)

    if data.get("type") == "Turn":
        transcript = data.get("transcript", "")
        end_of_turn = data.get("end_of_turn", False)

        if end_of_turn:
            # Final transcript — send to LLM
            print(f"Final: {transcript}")
        else:
            # Partial — can start pre-emptive LLM generation
            print(f"Partial: {transcript}")

    elif data.get("type") == "SpeechStarted":
        # User started speaking — handle barge-in
        print("Speech detected — interrupt agent if speaking")

ws = websocket.WebSocketApp(
    API_ENDPOINT,
    header={"Authorization": API_KEY},
    on_message=on_message,
)


Universal-3 Pro only sends partial results during silence periods, not after every word like Universal Streaming. This works perfectly for voice agents since you only want to act on complete thoughts anyway. For multi-speaker scenarios, you can enable streaming diarization using `speaker_labels: true` to identify different speakers in real-time.

Start streaming real-time transcripts in minutes

Use the u3-rt-pro WebSocket example with your own API key. Create an account to get credentials and begin streaming audio with turn formatting configured.

Get API key

Prompting for voice agent accuracy

Universal-3 Pro Streaming accepts a system prompt as a core parameter. This is how you customize accuracy for your specific use case without maintaining separate word lists.

Three ways to use prompting:

  • Entity correction: Include your brand names, product codes, and technical terms directly in the prompt
  • Format control: Tell the model how to format phone numbers, emails, and addresses
  • Turn detection tuning: Use the Strict Rules pattern alongside your domain instructions

Keep your prompts focused and shorter than you'd use for batch processing. The model works on short audio clips, so long prompts compete with the actual speech for attention.

Integration with orchestration frameworks

Most production voice agents use frameworks that handle WebSocket connections, audio formatting, and state management automatically. These frameworks let you focus on your voice agent logic instead of low-level speech-to-text API details.

You can find best practices for getting started with these frameworks in our docs here.

Testing and production readiness

Voice agent testing differs from regular transcription testing. You need to check speed across your entire pipeline, test accuracy with your specific vocabulary, and validate turn detection with real speech patterns.

Testing latency and accuracy

Measure your complete pipeline from audio input to final action, not just API response time. Track percentiles (50th, 75th, 95th) because the worst cases matter more than averages for user experience.

Critical areas to test:

  • Long entities: Email addresses, phone numbers, account numbers with your formatting
  • Short responses: "Yes," "no," "okay," "got it"—these get dropped frequently
  • Domain words: Your brand names, product codes, industry terminology
  • Natural speech: Include thinking pauses, trailing sentences, and filler words

Test with realistic background noise levels. Quiet audio can pick up ambient sounds and interpret them as speech, causing false triggering.

Create test sets that mirror your actual use cases. If you're building a customer service agent, test with frustrated customers speaking quickly. If it's a healthcare agent, include medical terminology and accented speech patterns.

Validate prompts and transcription quality quickly

Experiment with punctuation rules, entity formatting, and short responses in a no-code environment. Iterate on your approach before full integration.

Try playground

Monitoring and debugging in production

Track four key metrics that directly impact user experience:

Essential monitoring:

  • Speed percentiles: Track 50th, 75th, and 95th percentile latency per conversation
  • Entity accuracy: Monitor success rates by type (emails vs phone numbers vs product codes)
  • Turn detection rates: Measure false positives (interrupting users) and false negatives (waiting too long)
  • Connection health: WebSocket drops, reconnection frequency, error rates

Common debugging patterns:

  • High latency at worst cases: Check min_turn_silence setting and network path
  • Too many interruptions: Review your prompt against the Strict Rules pattern
  • Missing entities: Add missing domain terms to your system prompt
  • Connection issues: Monitor WebSocket error codes and implement proper reconnection logic

Set up alerts for latency above 500ms and accuracy drops below your baseline. These indicate problems that directly hurt user experience.

Final words

Building voice agents requires specialized real-time speech-to-text that handles the unique demands of conversational AI. Your implementation needs sub-300ms latency through proper min_turn_silence tuning, exceptional accuracy on critical tokens including short affirmative responses, prompt-controlled turn detection using explicit punctuation rules, and immutable transcripts that never change after delivery.

The choice between Universal Streaming and Universal-3 Pro Streaming depends on your specific needs—basic voice agents benefit from Universal Streaming's speed and cost efficiency, while complex applications requiring entity accuracy, domain prompting, or multilingual support need Universal-3 Pro's advanced capabilities. AssemblyAI's Voice AI platform provides both options with immutable transcripts, prompt-controlled turn detection, industry-leading entity accuracy, and auto-scaling concurrency.

Build your production voice agent today

Access Universal Streaming and Universal-3 Pro Streaming from one API. Sign up to get your key and ship real-time speech-to-text with immutable transcripts.

Get API key

Frequently asked questions

How do I decide between Universal Streaming and Universal-3 Pro Streaming for my voice agent?

Choose Universal Streaming for basic voice agents where speed and cost matter most, and your vocabulary is predictable. Pick Universal-3 Pro when you need perfect accuracy on emails, phone numbers, and short responses, when you want to use prompts for domain-specific terms, or when handling multiple languages in the same conversation.

What makes real-time speech-to-text different from regular batch transcription for voice agents?

Real-time processing delivers text as people speak with responses under 300ms, enabling natural conversation flow. Batch transcription waits for complete audio files and takes seconds—too slow for voice agents that need immediate understanding to respond naturally.

How do I use prompts to reduce interruptions in my voice agent?

Apply the Strict Rules prompt pattern that explicitly defines punctuation usage—periods only for complete sentences, commas for pauses, no punctuation for trailing speech. Include filler words like "um" and "uh" as continuation indicators. This approach reduces false interruptions compared to generic semantic prompts.

What's the best way to test speech-to-text latency for voice agents?

Measure end-to-end timing from audio input to your agent's action trigger, not just API response time. Track 95th percentile latency since worst-case delays determine user experience. Start with min_turn_silence: 100 for lowest latency and test how it affects turn detection quality with your specific audio conditions.

Why does my voice agent keep interrupting users mid-sentence?

Your system likely relies only on silence detection instead of understanding sentence completion. Universal-3 Pro's prompt-controlled turn detection uses punctuation patterns to identify complete thoughts. Implement the Strict Rules prompting pattern and tune min_turn_silence to balance responsiveness with accuracy.

Which voice agent frameworks work with real-time speech-to-text APIs?

Pipecat, LiveKit, and Vapi offer native integrations that handle WebSocket connections and audio formatting automatically. Pipecat provides two modes—one where Pipecat controls turn detection and one where the speech-to-text API handles it directly. Each framework includes documentation and examples for integration.

Title goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Button Text
AI voice agents
Streaming Speech-to-Text