What streaming speech to text model is best for voice agents and why?
Real time speech to text for voice agents explained: learn how streaming models work, compare latency and accuracy, and choose the right API for production.



Real-time speech-to-text converts spoken words into written text within milliseconds as audio streams—not after a recording ends, but as you speak. This is different from batch transcription, which waits for a complete audio file before processing begins. Real-time systems process audio continuously in small chunks, delivering text almost immediately.
This article covers how real-time speech-to-text works, when to use it over batch processing, and how to evaluate whether a system will perform in real-world conditions. Whether you're building live captions, a meeting transcription tool, or a voice agent, understanding the mechanics behind streaming transcription helps you make better decisions about accuracy, latency, and infrastructure before you write a single line of code.
What is real-time speech-to-text?
Real-time speech-to-text is transcription where audio is processed in continuous small chunks—typically 20 to 100 milliseconds each—so text results appear within milliseconds of speech. This means you see words on screen almost as you say them, rather than waiting for a recording to finish.
Two types of results come back from a real-time system:
- Partial results: Initial guesses that appear instantly as audio streams in, updating as the model hears more context—the word "to" might become "two" once surrounding words arrive
- Final results: Confirmed transcript output once the model has enough context to make accurate, stable predictions
The delay between speaking and seeing text on screen is called latency. Low latency is the defining characteristic of real-time transcription—and it's what makes voice-driven applications feel responsive rather than sluggish.
How streaming speech recognition works
A streaming speech-to-text system runs three stages simultaneously: capturing audio, processing it for predictions, and identifying speakers. Understanding each stage helps you see why some systems are faster or more accurate than others.
Audio capture and streaming
Your microphone captures sound waves and converts them to digital audio in continuous small chunks. Unlike recording to a file, the connection stays open for the entire session, creating a live pipeline where audio flows up and text flows back down.
Audio quality at this stage affects everything that follows. A few factors that shape quality:
- Microphone type: A headset mic close to your mouth captures cleaner audio than a laptop mic picking up room noise
- Environment: Background noise, echo, and reverberation all reduce transcription accuracy downstream
- Audio format: Most streaming APIs accept compressed audio formats like PCM or opus, which reduce bandwidth without significantly degrading quality
Real-time processing and partial results
The AI model starts predicting words from the very first audio chunk—it doesn't wait for a pause or sentence break. Two types of AI models work together to produce those predictions:
- Acoustic model: Maps sound patterns to phonemes, the basic building blocks of speech (like the "k," "æ," and "t" sounds in "cat")
- Language model: Uses surrounding context to resolve ambiguity and predict likely word sequences, deciding whether you said "their," "there," or "they're" based on what came before
This combination is why high-quality streaming models handle accented speech, background noise, and technical vocabulary better than lower-quality ones. The model makes educated guesses that get smarter with every additional chunk of audio it receives.
Speaker identification and timestamps
Speaker diarization is the process of identifying who said what in real time. The system labels each segment with a speaker identifier—"Speaker A," "Speaker B"—so the transcript records which person said which words. Word-level timestamps come with each result, making the transcript searchable and allowing captions to sync precisely with audio playback.
Multichannel audio, where each speaker has a separate audio channel, produces more reliable speaker identification than single-channel audio where all voices are mixed together. For phone calls or video conferences, single-channel diarization is standard—and modern models handle it well, especially when speakers take clear turns.
Real-time vs. batch processing
The choice between real-time and batch processing comes down to one question: does someone need to read or act on the text while the conversation is still happening?
If yes, use real-time. If you're working with a completed recording, batch is simpler and typically more accurate—because the model has access to the full audio context before making any predictions. Real-time models are working with limited future context, which introduces a small accuracy trade-off. Modern streaming models have closed this gap significantly, but it still exists for complex audio.
So if you're building a voice agent that needs to respond during a conversation, batch processing simply won't work—you can't wait for the call to end before generating a transcript. But if you're transcribing recorded interviews for a research team, batch gives you better accuracy with less infrastructure complexity.
What are the main use cases for real-time speech-to-text?
Four applications drive most real-time speech-to-text deployments: accessibility captions, meeting transcription, voice-driven interactions, and voice agents.
Live captions and accessibility
Live captions need the lowest latency of any real-time use case—text must appear in sync with speech or the experience becomes unusable. For people who are deaf or hard of hearing, accuracy errors don't just create minor confusion; they can block meaning entirely. Platforms like Zoom and Google Meet now integrate live captioning natively, making real-time transcription a baseline expectation rather than an advanced feature.
Meeting transcription and collaboration
Real-time meeting transcription creates a timestamped, searchable record of everything said—without anyone taking manual notes. Teams can search for specific topics, pull out action items, or review past decisions days or weeks later.
The practical value compounds over time. A searchable transcript library means a new team member can review six months of product decisions in an afternoon. Many transcription tools integrate directly with Slack or Microsoft Teams, so notes appear where teams already work rather than in yet another separate system.
Voice assistants and real-time commands
Voice assistants depend on real-time speech-to-text as their foundation—the system transcribes your command before anything else can happen. This is the most latency-sensitive use case. Users expect a near-instant response, and any noticeable delay breaks the interaction.
Enterprise applications extend this beyond consumer assistants:
- Surgeons dictate procedure notes without touching a keyboard
- Warehouse workers look up inventory while their hands stay on equipment
- Field technicians pull up repair guides using only their voice
Voice agents
Voice agents represent the fastest-growing use case for streaming speech-to-text—and the most demanding. Unlike a simple voice command, a voice agent holds a full conversation: it listens, understands, reasons, and responds with natural speech. The speech-to-text layer is where conversation quality is won or lost.
Here's the thing: if the transcription is wrong, the LLM responds to the wrong thing. An agent that mishears "I need to cancel my order" as "I need to scan my order" sends the conversation off a cliff. Accurate transcription of names, account numbers, and accented speech determines whether the agent responds to what was actually said—not what it guessed.
Voice agents also need capabilities that other use cases don't:
- Turn detection: The system needs to know when you're done talking versus pausing to think. Cut someone off mid-sentence and the experience feels broken. Sit in dead air for three seconds and it feels equally wrong.
- Interruption handling: When a user interrupts the agent, it should stop talking and listen—immediately. This "barge-in" behavior is what makes conversations feel natural rather than robotic.
- Entity accuracy: Email addresses, phone numbers, dates, and product names need to be transcribed correctly the first time. There's no "did you mean?" in a voice conversation—the agent just acts on what it heard.
For developers building voice agents, choosing the right streaming speech-to-text model isn't just a technical decision—it's a product decision that shapes every conversation your users will have.
How to evaluate real-time speech-to-text accuracy and performance
Two metrics define whether a real-time speech-to-text system works in practice: accuracy and latency. A fast but inaccurate system is just as problematic as an accurate but slow one—and evaluating them together, not separately, gives you a realistic picture of how a system will perform.
Word Error Rate (WER) is the standard accuracy metric. It measures the percentage of words that are incorrectly transcribed, inserted, or deleted. Lower is better. WER varies significantly depending on audio conditions, so a model that scores well on clean audio can still struggle with noisy phone calls or heavy accents.
But WER alone doesn't tell you whether a model will work for voice agents. You also need to evaluate entity accuracy—how well the model handles business-critical tokens like names, account numbers, email addresses, and domain-specific terminology. A model with 5% WER that consistently botches phone numbers is worse for a voice agent than a model with 7% WER that gets those entities right.
Latency breaks into three specific measurements:
- Time to First Token (TTFT): How quickly the first partial result appears after you start speaking—this determines how "live" the experience feels to users
- Real-Time Factor (RTF): How fast the system processes audio relative to its duration. An RTF below 1.0 means the system processes audio faster than real time, which is the minimum requirement for any live application
- Time to Complete Turn (TTCT): End-to-end conversational responsiveness for voice agents—measures the full cycle from when a speaker finishes talking to when the agent begins its response, critical for natural conversation flow
For voice agents specifically, you should also evaluate turn detection quality and interruption handling. These aren't traditional metrics you'll find on a benchmark page, but they make or break the conversational experience. A system with great WER but clunky turn detection—one that cuts users off or waits too long to respond—will frustrate users faster than a slightly less accurate system with smooth conversational flow.
Benchmarks only tell part of the story. A model that performs well on a clean dataset can still struggle with your specific vocabulary, audio setup, or speaker accents. Before committing to a provider, run tests using audio from your actual application—not vendor demos.
How to get started with real-time speech-to-text
Three paths exist for implementing real-time speech-to-text, depending on whether you're building something custom, need transcription to work immediately, or are building a voice agent.
Cloud APIs for custom applications
Cloud APIs are the primary path for developers. They handle all infrastructure complexity—streaming connections, model inference, partial result management—so you focus on your application logic. Most providers use WebSocket connections, which are persistent two-way connections that stay open during the entire session, letting audio flow up and text results flow back down continuously. SDKs are typically available for Python, JavaScript, Go, Java, and C#, though you can also connect directly via WebSocket without a library.
Ready-to-use applications
Ready-to-use applications are the right starting point if you don't need custom integration. Apps like Otter.ai for meeting transcription or Google's Live Transcribe for accessibility work immediately with no code required. These tools are also useful for testing transcription quality before building anything—you can assess whether a provider's accuracy meets your needs before writing a single line of integration code.
Voice Agent API for conversational AI
If you're building a voice agent, there's a third path that eliminates the biggest headache in voice AI development: stitching together separate speech-to-text, LLM, and text-to-speech providers. AssemblyAI's Voice Agent API combines all three into a single WebSocket connection—you stream audio in and get audio back.
Instead of managing three separate providers, three invoices, and three debugging surfaces, you get one API, one bill ($4.50/hr flat rate covering STT, LLM reasoning, and voice generation), and one place for logs and observability. The API includes built-in turn detection, voice activity detection, and interruption handling—features you'd otherwise need to implement yourself.
It's a standard JSON API with no SDK required. Most developers get a working voice agent running the same afternoon they start. The entire API reference can be read in 10 minutes, and it works natively with coding agents like Claude Code—copy the docs, paste them in, and build.
The right path depends on how much control you need over the underlying system and how tightly transcription needs to fit into your product.
Why speech-to-text accuracy is the foundation of voice agent quality
For voice agents, the speech-to-text model isn't just one component in the stack—it's the foundation everything else depends on. If the STT is wrong, the LLM responds to the wrong thing. Every downstream decision—reasoning, tool calls, response generation—is only as good as what the agent actually heard.
This is why purpose-built streaming models matter more for voice agents than for any other use case. A model built specifically for conversational audio handles the things that trip up general-purpose models: overlapping speech, background noise, accented speakers, and the short, fragmented utterances that are common in real conversations. It also handles entity recognition—accurately transcribing email addresses, phone numbers, and account numbers that a voice agent needs to act on.
AssemblyAI's Universal-3 Pro Streaming model—ranked #1 on the Hugging Face Open ASR Leaderboard—is purpose-built for this. The entire pipeline is designed around getting the input right: accurate transcription feeds into better LLM reasoning, which produces better responses. Turn detection combines acoustic signals with a contextual model so the agent knows when you're pausing to think versus when you're done talking. Interruption handling works immediately—when you barge in, the agent stops and listens.
For teams that want to skip the multi-vendor complexity entirely, AssemblyAI's Voice Agent API wraps this into a single connection: streaming speech-to-text, LLM reasoning, and natural voice generation through one WebSocket at $4.50/hr. One bill, one log stream, one number to work with when you're modeling costs. When something needs debugging, there's one place to look.
Final words
Real-time speech-to-text works by streaming audio in small chunks, generating partial results that refine into a final transcript with timestamps and speaker labels attached. Accuracy, latency, and real-world robustness are the three metrics that determine whether a system works for your specific use case—and the only reliable way to measure them is with your own audio.
For voice agents, the stakes are higher. Every transcription error compounds—the LLM reasons about the wrong input, generates the wrong response, and the conversation derails. Choosing a streaming model purpose-built for conversational audio, with strong entity accuracy and intelligent turn detection, is the single most impactful decision you'll make when building a voice agent.
AssemblyAI's Universal-3 Pro Streaming model delivers the accuracy and speed that voice agents demand, with support for English, Spanish, Portuguese, French, German, and Italian plus native code-switching. For teams that want the full stack in one API, the Voice Agent API handles speech understanding, LLM reasoning, and voice generation through a single WebSocket—$4.50/hr, no separate providers to manage.
Frequently asked questions about real-time speech-to-text
What streaming speech-to-text model is best for voice agents?
The best streaming speech-to-text model for voice agents combines low latency, high entity accuracy, and intelligent turn detection. AssemblyAI's Universal-3 Pro Streaming model ranks #1 on the Hugging Face Open ASR Leaderboard and is purpose-built for conversational audio—delivering accurate transcription of names, account numbers, and accented speech that voice agents depend on to respond correctly.
How does the Voice Agent API differ from using separate STT, LLM, and TTS providers?
AssemblyAI's Voice Agent API combines streaming speech-to-text, LLM reasoning, and text-to-speech into a single WebSocket connection at $4.50/hr flat. This replaces three separate providers, three invoices, and three debugging surfaces with one API. It includes built-in turn detection, interruption handling, and tool calling—features you'd otherwise need to build and maintain yourself.
How accurate is real-time speech-to-text in noisy environments?
Accuracy decreases in noisy environments, but modern streaming models trained on diverse audio conditions minimize this impact. For voice agents specifically, models purpose-built for conversational audio—like those trained on phone call data and real-world recordings—maintain significantly better accuracy than general-purpose models. Testing with your own audio before selecting a provider gives a more reliable picture than benchmark scores alone.
What happens when multiple speakers talk at the same time?
Most systems continue transcribing the dominant voice when speakers overlap, which can cause portions of quieter speech to be missed or merged. Speaker diarization works most reliably when speakers take clear turns rather than frequently talking over each other. For voice agents, this is less of a concern since conversations are typically two-party with natural turn-taking.
What latency should I target for a voice agent application?
Voice agents need end-to-end response times under one second for conversations to feel natural. This means the speech-to-text component should deliver results in under 300ms (Time to First Token), with the full turn cycle—from when the user stops speaking to when the agent starts responding—completing in under one second (Time to Complete Turn). Any noticeable delay breaks the conversational flow and frustrates users.
What programming languages have real-time speech-to-text SDKs?
Most cloud providers offer SDKs for Python, JavaScript, Go, Java, and C#, with WebSocket-based streaming as the standard connection method. AssemblyAI's Voice Agent API uses a standard JSON API that requires no SDK at all—you can connect directly via WebSocket from any language, and the simplicity means coding agents like Claude Code can scaffold a working integration from the docs alone.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
