May 19, 2026

How the Voice Agent API pipeline works, from audio in to audio out

A component-by-component walkthrough of the Voice Agent API pipeline for developers who need to understand the internals before trusting it in production.

Devon Malloy

Staff Growth Manager

Voice Agent API

Reviewed by

Table of contents

[Visible on live site]

A pipeline walkthrough for the developer who has been burned by magic APIs before.

"Managed API" is one of the most abused phrases in developer tooling.

It usually means: we made some decisions, we're not going to tell you what they are, and when something goes wrong on a customer call at 2am you'll stare at a 500 error with no idea what component failed. The abstraction was convenient in the demo. It's now your problem.

The Voice Agent API isn't that—or at least, it's trying very hard not to be. Every component in the pipeline has a name. Every decision surfaces. The observability is designed so that when a call goes wrong, you know exactly what happened and in what order. This post is the tour: what's inside, how it fits together, where you can reach in and adjust things, and—equally important—what the API honestly doesn't do yet.

If you're the kind of developer who needs to understand the internals before you trust something with a production customer call, this is written for you. For a higher-level introduction, see our voice agents guide.

The pipeline, named

Voice AI feels like magic when it works and feels like chaos when it doesn't. The reason it usually feels like chaos: there are six distinct processing stages between the moment a user speaks and the moment they hear a response—and in a DIY stack, those six stages run on infrastructure from three or four different vendors, log to four different places, and fail in ways that are nearly impossible to correlate.

Here's what those stages are, in order, in the Voice Agent API:

1. Voice focus and noise cancellation

Before transcription starts, the audio gets cleaned. Background noise—HVAC hum, browser tab audio, a crowded open office—is removed at the signal level. This isn't cosmetic. Commodity STT models degrade meaningfully on noisy audio; catching it before it hits the recognizer is the right place to handle it.

2. Universal-3 Pro Streaming STT

This is AssemblyAI's core recognition layer, running in streaming mode so transcripts arrive with low latency as the user speaks. For a deeper look at how streaming recognition works under the hood, see our primer on real-time speech recognition. A few things distinguish it from commodity alternatives:

Entity accuracy. This is the metric that matters most in voice agent use cases. When a user says "set my appointment for March 15th" or "my account number is 4471," getting the entity right is the whole job. Universal-3 Pro's missed entity rate is 16.7%, versus 25.5% for Deepgram Nova-3—a 35% relative improvement on the thing that actually breaks your downstream logic. If you're new to the metric, our piece on word error rate explains why entity-level accuracy often matters more than overall WER.

Promptability. You can pass keyterms—product names, jargon, proper nouns—and the model weights recognition toward them. This is the difference between "the customer said 'Zendesk'" and "the customer said 'Zen desk'" on a support call.

Multilingual support. Six languages, with more on the roadmap. Not separate models—the same Universal-3 Pro handles language switching within a conversation.

3. Turn detection and VAD

This is the hardest problem in voice AI and the one most systems get wrong.

The naive implementation is silence detection: wait for N milliseconds of quiet and assume the user is done. This works until someone pauses to think ("I'd like to... uh... cancel my subscription"), and then your agent interrupts mid-sentence, and the call experience falls apart.

The Voice Agent API handles this differently. Turn detection uses a model trained to distinguish "pause to think" from "done speaking," factoring in prosody, sentence structure, and conversational context—not just silence duration. You can configure the min_silence and max_silence thresholds for your use case, but the model does the heavy lifting.

More recently, the pipeline added a multimodal interruption detection model. This handles one of the harder edge cases: backchannels. When a user says "uh-huh," "yeah," or "amazing" while the agent is talking, older systems treat it as an interruption and cut the agent off. That's wrong—those are signals of engagement, not intent to speak. The model classifies these correctly, so your agent doesn't stutter every time a user acknowledges something.

4. The LLM Gateway

The LLM Gateway is the routing and orchestration layer that sits between the conversation and the language model. It handles:

Prompt management. Your system prompt passes through here and gets injected into each LLM call. This is where your agent's persona, instructions, and guardrails live.

Tool calling. Define tools via JSON Schema, and the gateway handles structured output parsing and routing. If the user asks to reschedule an appointment, the gateway parses the LLM's tool call and fires the right function. This is standard JSON throughout—no proprietary format to learn.

Live configuration updates. This is one of the more useful architectural decisions: you can update the system prompt, tool definitions, or turn detection settings mid-conversation without reconnecting. A user escalates from a self-service flow to a billing dispute? Update the system prompt for the new context. No dropped connection, no re-authentication, no awkward pause.

5. TTS—voice generation

The pipeline closes with text-to-speech: the LLM's response converts to audio and streams back to the user. The voices are tuned for conversational cadence—not narration, not presentation, but back-and-forth dialogue.

One honest note here, from AssemblyAI's own internal team: the TTS layer isn't the strongest component in the stack. The explicit position is that world-class STT with good-enough TTS still delivers better voice agent outcomes than the reverse. When the agent actually understands what the customer said—gets the entity right, handles the accent, catches the product name—the quality of the downstream TTS matters less. The bottleneck in most failed voice interactions is comprehension, not voice quality. That said, TTS improvements are in progress, and this is a clearly acknowledged area of investment.

6. Session management

The pipeline includes session resumption with a 30-second reconnect window. If a user loses connectivity briefly—a mobile handoff, a flaky Wi-Fi transition—the session context is preserved and the conversation can continue without starting over. You wouldn't build this yourself on a first pass, and it matters for production reliability.

Where you can customize

A common failure mode of managed APIs is the hidden parameter: the thing that's configurable but buried in a footnote. Here's a direct accounting of what's yours to control.

You controlThe API managesUpdatable live (no reconnect)System prompt and all LLM instructionsSTT inference and streamingSystem promptTool definitions (JSON Schema)Turn detection model executionTool definitionsVoice selectionInterruption classification (backchannel vs. intent to speak)Turn detection settingsTurn detection thresholds (min_silence, max_silence)LLM routing and request formationKeytermsKeyterms for STT bias, language setting, session configurationTTS synthesis, audio streaming, echo cancellation, audio format handling—

The honest framing: if you want to swap in a different LLM provider at the infrastructure level, or run your own fine-tuned TTS model, this isn't the right path. The Voice Agent API is designed for builders who want to configure the conversation—not the infra. If you need full infrastructure portability, the Streaming Speech-to-Text API is the right layer to build on. The full reference lives in our Voice Agents docs and the streaming STT docs.

Observability: one place to look

The most underrated feature of the Voice Agent API isn't a model or a configuration option. It's the fact that the entire event sequence flows through one conversation view.

When a call fails in a DIY stack, the debug process looks like this: check the STT vendor's dashboard for the transcript, check your LLM provider's logs for the completion, check your TTS provider's logs for the synthesis, then try to stitch together what actually happened in the right order. If the failure happened at the seam between two providers—which is where most failures live—you may never find it.

With the Voice Agent API, the full event sequence lives in one place: speech input, transcript output, LLM request, LLM response, TTS generation, audio output. A failed turn doesn't require cross-provider log correlation. Open the session view, scroll to the turn that failed, and see the full chain. Speaker diarization data sits alongside the transcript for multi-party calls.

Session transcripts are available in the dashboard. You can inspect any conversation, see the transcript with timestamps, and trace what happened at each stage. For teams building support or sales agents where conversation quality is a business metric, this isn't just a debugging convenience—it's a data asset for downstream Speech understanding workflows like sentiment, topic, and summary extraction.

What the API doesn't do

A good abstraction earns trust by naming its limits. Here's what the Voice Agent API doesn't do in its current form:

LLM provider portability. You're using the LLM Gateway, not a custom model endpoint. You can't swap in a fine-tuned model hosted at your own inference endpoint in V1. This is a deliberate scoping decision—the gateway handles routing, prompt injection, and tool call parsing in a way that's tightly integrated with the rest of the pipeline. Custom model endpoints are on the roadmap.

Cloned voices. Voice cloning—training a TTS voice on a specific speaker—isn't available in V1. It's on the roadmap for teams that need brand-specific voice consistency.

Compliance-certified infrastructure integration. If your organization has already built or certified a specific transcription or storage infrastructure—BAA-backed deployment, SOC 2 audited, integrated with your DLP tooling—the Voice Agent API won't slot into that infrastructure in V1. (AssemblyAI is a business associate under HIPAA and offers a standard Business Associate Addendum for customers processing PHI; today that BAA path runs through our Speech-to-Text and Streaming STT products.) That's a use case for the Streaming STT layer, where you control the full stack.

The positioning here is clear: use the Voice Agent API when you're starting fresh, latency is your top priority, and you'd rather ship fast than configure everything yourself. If you have significant existing infrastructure requirements or need full control of every component, the Streaming STT API is the right starting point. Teams running voice agents, meeting intelligence, and AI notetakers often mix both layers.

Have a compliance or infra requirement to map?
Talk to our team about BAA-eligible deployments, custom routing, and enterprise concurrency.

What $4.50/hr is actually buying you

The pipeline described above runs at $4.50 per agent hour. That's the all-in price: STT, turn detection, interruption handling, LLM routing, TTS, session management, noise cancellation. See full pricing for breakdowns by product.

What it buys you isn't opacity. It's not a black box that happens to answer phone calls. It's a six-stage pipeline with named components, configurable behavior, live update capability, and a single observability layer that shows you exactly what happened on every turn of every conversation.

The alternative—building this pipeline yourself—is well-understood at this point. The component costs alone approach $4.50/hr before you add the engineering time to integrate them, the operational time to maintain them, and the debugging time when they fail at seams you don't control. The reliability and concurrency characteristics AssemblyAI has built into this pipeline took significant infrastructure investment to get right. You can build it, but you're not starting from zero.

Good abstractions earn trust by being transparent about what they're abstracting. This one is. The components are named. The decisions surface. The tradeoffs are acknowledged. That's the right foundation for production voice infrastructure—not magic, just engineering that you can see.

Frequently asked questions

What is the Voice Agent API?

The Voice Agent API is AssemblyAI's managed pipeline for building real-time voice agents. It combines noise cancellation, the Universal-3 Pro Streaming model, turn detection, an LLM Gateway, and text-to-speech into a single, observable stack accessed through one connection. See the Voice Agents docs for the full API surface.

How does the Voice Agent API differ from the Streaming STT API?

The Streaming STT API exposes only the recognition layer, so you bring your own turn detection, LLM, and TTS. The Voice Agent API bundles the full conversational pipeline—STT, turn detection, LLM routing, TTS, and session management—at $4.50 per agent hour, optimized for fast time to production.

Can I change the system prompt or tools mid-conversation?

Yes. The LLM Gateway accepts live configuration updates for the system prompt, tool definitions, turn detection settings, and keyterms. Changes take effect immediately in the current session—no reconnect, no re-authentication.

Which languages does the Voice Agent API support?

The Universal-3 Pro Streaming model currently supports six languages with in-conversation language switching, and additional languages are on the roadmap. Check the documentation for the current list.

Can I use the Voice Agent API for healthcare or PHI workloads?

AssemblyAI is considered a business associate under HIPAA and offers a standard Business Associate Addendum (BAA) for customers processing PHI. Today, BAA-backed deployments run through our Speech-to-Text and Streaming STT products; for Voice Agent API healthcare use cases, contact sales.

How is observability handled across the pipeline?

Every event—speech input, transcript output, LLM request and response, TTS generation, audio output—flows through a single conversation view in the dashboard. Session transcripts include timestamps for each stage, so failed turns can be inspected without correlating logs across vendors.

How the Voice Agent API pipeline works, from audio in to audio out

The pipeline, named

1. Voice focus and noise cancellation

2. Universal-3 Pro Streaming STT

3. Turn detection and VAD

4. The LLM Gateway

5. TTS—voice generation

6. Session management

Where you can customize

Observability: one place to look

What the API doesn't do

What $4.50/hr is actually buying you

Frequently asked questions

What is the Voice Agent API?

How does the Voice Agent API differ from the Streaming STT API?

Can I change the system prompt or tools mid-conversation?

Which languages does the Voice Agent API support?

Can I use the Voice Agent API for healthcare or PHI workloads?

How is observability handled across the pipeline?

Which voice agent API has the best developer experience? What to evaluate

Voice agent architectures explained: STT→LLM→TTS vs. speech-to-speech vs. one API

Best API for building a speech-to-speech voice agent in 2026

AssemblyAI vs Deepgram: what's the best voice agent API?

Introducing Entity Detection - Detect Named Entities in Audio/Video

Improved Accuracy on AssemblyAI’s Real Time Speech-to-Text API

Kaldi Install for Dummies

Speech-to-text API pricing guide: Per-minute, per-hour and feature costs explained

How the Voice Agent API pipeline works, from audio in to audio out

The pipeline, named

1. Voice focus and noise cancellation

2. Universal-3 Pro Streaming STT

3. Turn detection and VAD

4. The LLM Gateway

5. TTS—voice generation

6. Session management

Where you can customize

Observability: one place to look

What the API doesn't do

What $4.50/hr is actually buying you

Frequently asked questions

What is the Voice Agent API?

How does the Voice Agent API differ from the Streaming STT API?

Can I change the system prompt or tools mid-conversation?

Which languages does the Voice Agent API support?

Can I use the Voice Agent API for healthcare or PHI workloads?

How is observability handled across the pipeline?

Related posts

Which voice agent API has the best developer experience? What to evaluate

Voice agent architectures explained: STT→LLM→TTS vs. speech-to-speech vs. one API

Best API for building a speech-to-speech voice agent in 2026

AssemblyAI vs Deepgram: what's the best voice agent API?

Introducing Entity Detection - Detect Named Entities in Audio/Video

Improved Accuracy on AssemblyAI’s Real Time Speech-to-Text API

Kaldi Install for Dummies

Speech-to-text API pricing guide: Per-minute, per-hour and feature costs explained