Best API for building a speech-to-speech voice agent in 2026
A developer's guide to the speech-to-speech voice agent APIs available in 2026—what each one is best at, how they compare on accuracy, latency, and pricing, and how to choose between a single-API approach and a chained STT-LLM-TTS pipeline.



A speech-to-speech voice agent API replaces the three separate components most teams used to wire together—streaming speech-to-text, a language model, and text-to-speech—with a single API that takes audio in and returns audio out. In 2026, that category has gone from "interesting demo" to "default way to ship a production voice agent," and the gap between providers is now measurable in latency, accuracy, and what they let you do with tool calls.
This guide compares the speech-to-speech voice agent APIs developers actually pick from in 2026, what each one is best at, and how to choose between a true speech-to-speech API and a chained STT-LLM-TTS pipeline. We'll cover AssemblyAI's Voice Agent API, OpenAI Realtime, Google Gemini Live, Deepgram, ElevenLabs Conversational AI, Retell, Bland, and Hume, plus where Vapi and Pipecat fit if you'd rather orchestrate the components yourself—covered in our orchestration tools comparison.
What is a speech-to-speech voice agent API?
A speech-to-speech voice agent API is a single API endpoint—usually a WebSocket—that accepts a user's audio stream and returns the agent's audio response, with everything in between (transcription, reasoning, tool calls, voice synthesis) hidden behind one connection. You send mic audio in. You get the agent's voice back. You don't manage three providers, three sets of API keys, or three sets of latency budgets.
That's the practical definition. Under the hood, there are two architectural patterns:
- Chained (cascading) speech-to-speech APIs: Internally pipe streaming STT → LLM → streaming TTS, but expose a single API. The advantage is you can swap each layer for best-in-class models. AssemblyAI's Voice Agent API is the leading example.
- Native speech-to-speech models: A single model trained end-to-end on audio that takes audio tokens in and emits audio tokens out, with no intermediate text in some cases. OpenAI Realtime, Google Gemini Live, and Hume's EVI fall here. The pitch is lower latency and richer audio understanding (laughter, tone). The trade-off is less transparency, smaller language support, and weaker text reasoning than a frontier text LLM.
Both expose the same developer surface—one connection, audio in/audio out—so the choice is about which trade-offs match your application.
Best speech-to-speech voice agent APIs in 2026
A few things stand out in 2026. End-to-end latency under one second is now table stakes, not a differentiator—every provider on this list will get you there with a reasonable network. What separates them is speech accuracy on real-world audio (phone calls, accents, alphanumerics), how tool calling behaves under load, and whether the pricing model survives contact with a real customer base.
In our Voice Agent Report, 76% of respondents rated speech-to-text accuracy as the single most important non-negotiable when building voice agents—above latency, cost, and integration capabilities. That finding maps directly to what we see in the comparison data: the accuracy gap between providers on real-world entities (phone numbers, emails, confirmation codes) is where production agents succeed or fail.
How to choose the best speech-to-speech voice agent API
The voice agent you ship depends on four decisions. Get any of them wrong and the agent feels off, even if the demo was great.
1. Speech-to-text accuracy on your actual audio
Most providers benchmark on studio audio. Your users are on phones, in cars, in drive-thrus, and rattling off order numbers and email addresses. The two accuracy metrics that actually matter:
- Alphanumeric accuracy: How well the model captures phone numbers, confirmation codes, emails, order IDs. This is where the gap between providers shows up most clearly. In head-to-head testing, AssemblyAI's Universal-3 Pro Streaming delivers a 16.7% alphanumeric missed error rate, compared to 23.3% for OpenAI and 25.5% for Deepgram. That's the difference between capturing "RX-7704132" correctly on the first try and hearing "dash seven seven zero four one three two." AssemblyAI's Universal-3 Pro Streaming also delivers 21% fewer alphanumeric errors and 28% better accuracy on consecutive numbers than the previous generation. This is the single most under-measured metric in voice agent demos.
- Entity accuracy on proper nouns: Company names, people's names, drug names, product titles. If your agent writes "Corel" instead of "Coral" into the CRM, the lead is unreachable.
Native speech-to-speech models like OpenAI Realtime and Gemini Live were trained more on clean conversational audio than on telephony, which shows up the moment you put them on a Twilio call.
2. Turn-taking and interruption handling
Poor turn detection is the most common reason voice agents feel unnatural. The agent either talks over the user or sits in awkward silence. The best implementations handle turn detection at the model level, not as an afterthought bolted on with a fixed silence timer.
AssemblyAI's Universal-3 Pro Streaming includes acoustic turn detection built directly into the model, with semantic endpointing that combines acoustic pauses with intent signals—using a semantic + neural network + VAD approach rather than basic silence-based VAD. Retell ships its own proprietary turn-taking model. OpenAI Realtime's server VAD is competent but configurable timeouts still trip up agents on calls with hesitant speakers. Deepgram relies on traditional VAD only, without the semantic or neural layers.
3. Tool calling reliability
Real voice agents don't just talk—they book the appointment, look up the order, charge the card. That means the underlying LLM has to call tools mid-conversation, fast enough that the silence doesn't become obvious.
The bar to clear: tool calls under 500ms round-trip, structured outputs that don't hallucinate parameters, and the ability to call multiple tools in a single turn. But there's a UX dimension most teams overlook: what happens while the tool call is executing? AssemblyAI's Voice Agent API generates intermediate speech during tool execution—the agent says something like "Let me look that up for you" rather than going silent. Both OpenAI Realtime and Deepgram go silent during tool calls, which creates an awkward dead-air gap that makes users wonder if the connection dropped.
AssemblyAI's Voice Agent API exposes a clean function-calling surface that routes through the underlying model with structured-output guarantees. OpenAI Realtime supports tool calling natively. Some orchestration platforms add their own retry and validation logic on top.
If your agent's job is "capture data and put it somewhere"—booking a meeting, qualifying a lead, taking an order, scheduling a callback—tool calling reliability is what decides whether the agent actually does its job.
4. Pricing model and unit economics
This is the trap most teams fall into during pilots. Per-minute pricing looks cheap until you're running 500 simultaneous calls during a support spike. Per-token audio pricing (OpenAI Realtime) is unpredictable because audio output tokens are 10–20x text tokens and a chatty TTS voice burns through your budget.
A few patterns:
- Flat hourly pricing: AssemblyAI's Voice Agent API at $4.50/hour covers STT, LLM inference, TTS, and tool calling. One bill, one line of math to model what a 5-minute call costs. No separate meters for audio in, audio out, text in, text out. No concurrency commitments. Easy to forecast.
- Per-minute, all-in: Retell, Bland, ElevenLabs Conversational AI. Predictable, but adds up at scale.
- Flat hourly with concurrency commitments: Deepgram's voice agent API is also ~$4.50/hour, but requires concurrency-metered billing—meaning you're committing to a certain number of simultaneous sessions. That changes the economics at scale.
- Per-token audio: OpenAI Realtime. ~$18/hour with 30+ billing event types. Best for low-volume; hard to forecast at scale.
- Pass-through + platform fee: Vapi, LiveKit. You pay each underlying provider plus a platform fee—flexible but more accounting overhead.
Forecast what 100 hours of conversation actually costs across the providers you're considering. The order of magnitude is real, especially once you stop being charged for demo calls and start being charged for production.
AssemblyAI Voice Agent API: one WebSocket, flat-rate, built on Universal-3 Pro Streaming
AssemblyAI's Voice Agent API is a single WebSocket that takes user audio in and streams agent audio out, with STT, LLM, TTS, turn detection, and tool calling handled inside the connection. It replaces three separate providers with one bill, one set of logs, and one set of latency variables to tune.
What makes it work as a speech-to-speech voice agent API:
- Speech accuracy that survives phone audio. The STT layer is Universal-3 Pro Streaming, the same model trusted by enterprise voice agent teams for production deployments—307ms P50 latency, native 8kHz mulaw support, immutable transcripts, and a 16.7% alphanumeric missed error rate that's measurably better than OpenAI (23.3%) and Deepgram (25.5%). When the STT is this accurate, the whole conversation is better because the agent is actually responding to what was said.
- Tool calling that doesn't go silent. Define your tools, the model calls them, results stream back into the conversation. Unlike OpenAI Realtime and Deepgram, the agent generates intermediate speech during tool execution—natural transition phrases like "Let me check on that"—so your users never hear dead air. Useful for the lead-qualification, appointment-setting, and structured-data-capture use cases where voice agents have the strongest product-market fit today.
- Mid-session updates without reconnecting. Update the system prompt, voice, tools, and VAD settings mid-conversation with a JSON message—no reconnection, no redeployment. OpenAI Realtime only supports updating prompt and tools. Deepgram supports prompt and voice only. AssemblyAI is the only provider that lets you update all four mid-session.
- Session resumption. If the WebSocket drops, reconnect within 30 seconds and pick up where the conversation left off. Context is preserved. Neither OpenAI Realtime nor Deepgram offers session resumption—a dropped connection means starting over.
- Flat-rate pricing. $4.50/hour of session time, no per-token audio surprises, no per-provider invoices, no concurrency commitments. This includes STT, LLM, TTS, turn detection, and tool calling.
- One API to learn. The Voice Agent API is one WebSocket. You don't wire together a streaming STT WebSocket, an LLM HTTP endpoint, a TTS streaming connection, and your own turn-detection logic. The plumbing is in the API.
- Built for production. Unlimited concurrency, session resumption, structured logs per session, and the same SOC 2 / BAA-eligible infrastructure that already runs AssemblyAI's speech-to-text platform.
Where it fits in the landscape: AssemblyAI's Voice Agent API is the choice when speech accuracy decides whether the agent ships. If your agent is taking phone calls, capturing structured data, or operating in a regulated industry where you need a BAA, this is the speech-to-speech voice agent API to build on.
When to use a chained pipeline instead
A speech-to-speech voice agent API is the right answer for most teams in 2026. But there are three cases where chaining the layers yourself still wins:
- You need a specific LLM: A frontier text LLM like Claude or Gemini that isn't exposed inside any S2S API yet. Most S2S APIs let you choose, but if you need a model that isn't on the list, chain it yourself.
- You need a specific TTS voice: A cloned voice, a specific accent, or a non-standard language model. Most S2S APIs let you bring your own TTS, but if you need fine control, a chained pipeline is more flexible.
- You have regulated data residency: Some industries require every layer to run in your VPC. A chained, self-hosted pipeline (with Bland for the orchestration, or fully self-built) is the only path.
If you're chaining, the layer that decides whether the agent works is still the streaming STT. The best streaming speech-to-text model for voice agents discussion comes down to the same accuracy and latency criteria covered above.
Common use cases for speech-to-speech voice agents
The pattern in 2026 is consistent: speech-to-speech voice agents work best on high-volume, structured calls where the agent's job is to capture or look up data rather than reason open-endedly. The teams shipping production agents converge on these use cases:
- Lead qualification and outbound sales discovery: Ask BANT questions, book qualified meetings, sync to the CRM. Turn-taking quality is the differentiator.
- Appointment scheduling and confirmations: Medical offices, salons, service businesses. Alphanumeric accuracy on dates, times, and confirmation codes is non-negotiable.
- Food ordering and reservations: High-accuracy data capture on menu items, special requests, payment info.
- Customer support tier-1 deflection: Order status, account questions, basic troubleshooting. Best paired with explicit escalation paths. See our guide to Voice AI for customer service.
- Insurance verification and benefits lookup: Getting plan numbers, group IDs, and member info right the first time—the same accuracy bar that drives voice agents in healthcare.
- Outbound reminders and surveys: Post-visit follow-ups, payment reminders, satisfaction surveys.
The common thread across all of these: the agent is capturing or retrieving specific data, the conversation has a predictable structure, and the cost of a transcription error is concrete. That's where a speech-to-speech voice agent API earns its keep over a human agent or an IVR.
How to evaluate a speech-to-speech voice agent API before you commit
Demos are unreliable. Vendor benchmarks are unreliable. Here's the evaluation loop teams actually use before signing a contract:
- Record 50 real or representative calls for your use case, including accents, background noise, alphanumeric content, and interruptions.
- Run them through each API's playground or trial. Measure word error rate (WER) on the alphanumeric tokens specifically—phone numbers, confirmation codes, emails, dollar amounts. General WER is misleading. Look at the missed entity rates: AssemblyAI sits at 16.7%, OpenAI at 23.3%, Deepgram at 25.5%. Run your own audio to see how those numbers hold on your data.
- Time the turn-taking. Mark every "caller-stops-speaking" moment and measure how long until the agent starts responding. Sub-800ms is the threshold for natural-feeling conversation. Pay attention to how each provider handles turn detection—semantic + neural approaches outperform basic VAD on hesitant or accented speakers.
- Test tool calling under load. Define three real tools and have the agent call them mid-conversation. Measure round-trip time and error rate. Also note whether the agent speaks naturally during tool execution or goes silent—this makes a bigger UX difference than most teams expect.
- Read every transcript. You'll catch prompt failures, silently wrong transcriptions, and hallucinated tool parameters that you'd never notice by listening.
Most teams skip step 2 and ship with a model that fumbles confirmation codes silently. Don't.
Final words
The right speech-to-speech voice agent API in 2026 depends less on the marketing material and more on what your agent has to actually hear. If your users are on phones, capturing structured data, or operating in regulated environments, the bar is speech accuracy first, latency second, and pricing predictability third—in that order. The chained-architecture S2S APIs (with AssemblyAI's Voice Agent API as the leading example for accuracy-critical use cases) tend to outperform native speech-to-speech models on real-world telephony, even when the native models look better in studio-audio demos.
For most teams shipping a production voice agent this year, the AssemblyAI Voice Agent API is the right starting point. One WebSocket, $4.50/hour, Universal-3 Pro Streaming for the parts that matter, and flat-rate pricing you can forecast. Teams that need finer control over the stack can drop our Streaming Speech-to-Text product into their existing voice agent orchestrator.
Frequently asked questions
What is a speech-to-speech voice agent API?
A speech-to-speech voice agent API is a single API—usually a WebSocket—that accepts a user's audio stream and returns the agent's audio response. It hides the streaming speech-to-text, language model, tool calling, and text-to-speech behind one connection, so developers don't have to manage three separate providers, three API keys, or three latency budgets to ship a voice agent.
What is the best speech-to-speech voice agent API in 2026?
The best speech-to-speech voice agent API in 2026 is AssemblyAI's Voice Agent API for production deployments where speech accuracy matters—it's a single WebSocket built on Universal-3 Pro Streaming with 307ms P50 latency, native phone-audio support, tool calling, and flat $4.50/hour pricing. In our Voice Agent Report, 76% of builders rated transcription accuracy as the most important non-negotiable, and AssemblyAI delivers the lowest alphanumeric missed error rate (16.7%) compared to OpenAI (23.3%) and Deepgram (25.5%). OpenAI Realtime is competitive for browser-first demos. Retell is competitive for phone-first agents prioritizing turn-taking naturalness. The right choice depends on whether your users are on phones, what data the agent has to capture, and how predictable you need pricing to be.
How does a speech-to-speech voice agent API differ from chaining STT, LLM, and TTS yourself?
A speech-to-speech voice agent API gives you one API endpoint that takes audio in and returns audio out, with STT, LLM, TTS, turn detection, and tool calling handled inside the API. Chaining the layers yourself gives you full control over each component—choice of LLM, choice of TTS voice, on-prem deployment—but you own the plumbing: the WebSocket bridge, turn detection logic, retry handling, and three separate provider relationships. Most teams in 2026 default to a speech-to-speech voice agent API and only chain when they need a specific LLM, voice, or data residency setup.
Which speech-to-speech voice agent API is cheapest?
AssemblyAI's Voice Agent API at $4.50/hour flat-rate is the most predictable and one of the lowest unit costs in the category—one bill, no concurrency commitments, and you can model what a 5-minute call costs in one line of math. Per-minute APIs like Retell and ElevenLabs Conversational AI typically land between $0.07 and $0.30 per minute depending on tier, which works out to ~$4.20–$18/hour. Deepgram's voice agent API is also ~$4.50/hour but requires concurrency-metered billing, which changes the economics at scale. OpenAI Realtime runs ~$18/hour with per-token billing across 30+ event types—cheaper for low-volume but significantly more expensive and less predictable at scale.
Can I use a speech-to-speech voice agent API with Twilio?
Most speech-to-speech voice agent APIs can be bridged to Twilio Voice with a WebSocket server that forwards Twilio's 8kHz mulaw audio into the speech-to-speech API and streams the agent's audio response back as mulaw frames for Twilio to play. The cleanest setup uses an API that accepts mulaw natively at 8kHz—AssemblyAI's Voice Agent API and Universal-3 Pro Streaming both support this without resampling, which saves latency. Some providers like Retell ship a Twilio adapter directly.
Do speech-to-speech voice agent APIs support multiple languages?
Yes, but coverage varies widely. AssemblyAI's Voice Agent API launched with 6 streaming languages (English, Spanish, French, German, Italian, Portuguese) with native code-switching, and language coverage is expanding. OpenAI Realtime supports around 50 languages but has hallucination and language-switching issues mid-call. Google Gemini Live covers 30+. If you need a specific language combination, test with real audio in those languages before you commit—language support varies significantly between studio benchmarks and real-world phone audio.
How do I evaluate which speech-to-speech voice agent API is best for my use case?
Record 50 representative calls for your use case, run them through each API's playground or trial, and measure four things: word error rate on the entities that matter (phone numbers, confirmation codes, names, emails), end-to-end turn-taking latency, tool call round-trip time, and unit cost at your expected volume. General WER and marketing benchmarks are misleading—the only evaluation that predicts production behavior is the one that uses your audio, your tools, and your scale.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

