May 21, 2026

Why AssemblyAI voice agents are built differently

Why AssemblyAI built its Voice Agent API as a unified pipeline designed for coding agents — not a multi-vendor stack behind a visual UI.

Devon Malloy

Staff Growth Manager

Voice Agent API

Reviewed by

Table of contents

[Visible on live site]

Most voice agent tools were built around a specific assumption: that the right way to make something accessible is to put a UI in front of it. Fill in a form, pick a voice from a dropdown, click save. That assumption made sense when it was formed. It's no longer obviously correct.

AssemblyAI built its Voice Agent API around a different assumption: that the best way to build a voice agent is through a coding agent, not a visual interface, and that a well-designed API should make that path faster than any form ever could.

This post explains what that means in practice: what a voice agent is, how the pipeline works, where the industry defaulted to the wrong architecture, and what AssemblyAI chose to do differently.

What a voice agent is

A voice agent is software that can hold a spoken conversation in real time. You speak; it listens, decides what to say, and speaks back. The conversation continues turn by turn.

Three things it's worth not confusing it with. It's not a chatbot with audio added: a chatbot exchanges text asynchronously, where latency and interruption handling don't exist as problems. It's not an IVR ("press 1 for billing"): IVR systems navigate rigid menus and respond to button presses, not natural speech. And it's not a transcription service: transcription converts audio to text and stops; a voice agent uses transcription as one step in a pipeline that also reasons and speaks back.

Where AI voice agents get used: appointment scheduling, lead qualification, clinical intake, customer support, sales training. In every case, the agent holds a conversation so a human doesn't have to, and it does it well enough that the experience feels natural.

The pipeline—what happens during every turn

Every voice agent runs the same sequence on every turn. Understanding it is what makes every other design decision legible.

Listen. The user's audio is captured and converted to text in real time via speech-to-text (STT). The agent doesn't wait for the speaker to finish; it streams audio as it arrives using streaming speech-to-text.
Detect. The agent determines the user has finished speaking. This is turn detection, and it's harder than it sounds. People pause mid-sentence, trail off, use filler words. The agent has to reliably distinguish "thinking pause" from "done talking" in under a second, across accents and noise conditions.
Reason. The transcript is passed to a language model, which reads the full conversation history and generates a response. This is where the agent's behavior lives: its memory, its role, its ability to call tools like "look up this account" or "book this time slot."
Speak. The response is converted to audio via text-to-speech (TTS) and streamed back. The agent begins speaking before the full response is generated, which is how you get sub-second latency.

The hard problems are in the handoffs. Interruption handling requires stopping TTS output mid-stream, discarding any response already in flight, and resuming transcription, all simultaneously, without any of those systems having direct visibility into what the others are doing. These coordination problems don't exist in text-based AI. They're what make voice genuinely difficult and what most voice agent stacks spend the most engineering time on.

How most of the industry built the pipeline—and why it creates problems

The default approach to building a voice agent is to wire together three separate vendors: one for STT, one for an LLM, and one for TTS. Each component is independently best-in-class. The tradeoff is that coordinating them falls on the developer. When choosing a STT API for voice agents, this coordination cost should be a primary consideration.

That coordination cost is real and ongoing. Four log streams when something breaks. Four billing relationships with different pricing models and different ways of calculating a "turn." Four sets of API credentials to rotate. Four release cycles producing breaking changes on someone else's schedule. And critically, four independent systems that don't share context, which means turn detection and interruption handling have to be built as glue code that sits between components with separate internal clocks, separate streaming formats, and no shared state.

If each of those vendors runs at 99.9% availability, the assembled pipeline runs at 99.6%—before any code of yours has run. That gap is a structural property of the architecture, not a quality problem at any individual vendor.

The visual builder tools that sit on top of this architecture hide the complexity behind forms, but they don't reduce it. The coordination burden still exists; it just moves to a place the builder can't see or change.

What AssemblyAI did differently, and why

AssemblyAI built the Voice Agent API as a unified pipeline, not a set of wired-together components. Powered by Universal-3 Pro Streaming as the speech recognition foundation, the pipeline's turn detection, language model, and voice generation all share context inside the same system. The coordination that would otherwise live in your glue code happens at the infrastructure level instead.

The practical consequence: one WebSocket connection, not four vendor integrations. One log stream when something breaks. One billing relationship at a flat rate per hour. One team accountable for the full pipeline, which means one place to report a problem and one place to look for the fix.

That unification also made a smaller API surface possible. The OpenAI Realtime API (a well-designed product) exposes more than 30 event types, because building on it requires explicit handling of the handoffs between its internal components. The Voice Agent API exposes approximately six. That isn't a limitation on what you can build. It's evidence that the coordination has moved inside the system rather than being distributed to developers.

The simplicity of the surface is what makes the coding agent workflow work. When Claude Code, Cursor, or Copilot reads the Voice Agent API setup prompt and produces a working integration, the output is short, readable, and easy to change, because there isn't much to write. You get a session configuration, a WebSocket connection, and an event loop. The rest is your product logic.

Why the build interface is a coding agent, not a form

The other major choice AssemblyAI made is about how you interact with the API. Most voice agent platforms offer a visual UI as the primary interface. You navigate forms, change fields, click save. This is the model Vapi built around: an interface designed for the era before coding agents existed, when "accessible" meant "visual."

Coding agents change what accessible means. When a coding agent can read a setup prompt and build a working voice agent project in an afternoon, the bottleneck isn't navigating a form. It's understanding what you're building well enough to describe it. That's a problem education solves, not one that requires a better UI. Developers can vibe code a voice agent with the Voice Agent API — paste the setup prompt into a coding agent and ship a working project the same day.

AssemblyAI's position is that the coding agent is the better long-term build experience. Not because the product is unfinished, but because the interaction model is more powerful. When your agent's behavior lives in code you own, you can change anything (system prompt, voice, tool definitions, turn detection thresholds) by describing the change to the coding agent in plain language. No field to find. No form to navigate. No constraint on what the UI was designed to support.

The tradeoff is that this requires you to understand what a coding agent is and how to use it. That's the gap the companion post closes.

What this means for you as a builder

When you connect to the Voice Agent API, you're not configuring components. You're configuring a complete, coordinated agent. The system prompt is the primary control surface: plain text describing who the agent is, what it should do, and what it should avoid. Changing the agent means changing the prompt or adding tools. The infrastructure stays stable underneath.

The unified pipeline also handles voice agents in noisy environments more effectively than multi-vendor stacks, because the speech recognition and turn detection share context about background audio conditions rather than operating independently.

Five terms you'll encounter in the documentation:

Session: one continuous conversation from connect to disconnect. Context persists across the full session.
Turn: one exchange: one user utterance and one agent response.
System prompt: the plain-text description of the agent's behavior. This is what you edit most. No code required to change it.
Tool calling: the mechanism for giving the agent the ability to take actions mid-conversation: look up a record, book an appointment, query an API. You define the functions; the agent decides when to call them.
WebSocket: a persistent two-way connection that stays open for the length of the session, streaming audio in both directions continuously. Different from a standard REST API call in that it doesn't close after a single request.

Where to go from here

If you've never used a coding agent: start with the companion post, which explains what a coding agent is, what the setup prompt does, and what the build looks like step by step.

If you're ready to build: the Voice Agent API quickstart is at assemblyai.com/docs/voice-agents/voice-agent-api. The setup prompt is in the docs. Paste it into Claude Code or Cursor and the agent will build the project.

If you want to see one working before you write any code: the playground is at assemblyai.com/playground.

The choices AssemblyAI made (unified pipeline, small event surface, coding agent as the build interface) are connected. They reflect a single bet: that the developers who build the best voice agents will be the ones who understand what they're building, own the code that runs it, and have a coding agent to help them change it. This post was the first part of that. The build is next.

Frequently asked questions

What is AssemblyAI's Voice Agent API?

The Voice Agent API is a single WebSocket-based API that handles the full voice agent pipeline — speech-to-text, language model reasoning, and text-to-speech — in one connection. Instead of wiring together separate vendors for each stage, developers connect once and get a coordinated system at a flat rate of $4.50 per hour.

How is a unified pipeline different from wiring together separate vendors?

With separate vendors, you manage four connections, four billing relationships, four log streams, and write glue code to coordinate handoffs between systems that don't share context. A unified pipeline runs all stages inside one system with shared state, which means turn detection and interruption handling work more reliably because the components can see what each other are doing. One connection, one bill, one place to debug.

What makes the Voice Agent API different from the OpenAI Realtime API?

The Voice Agent API exposes approximately six event types compared to over 30 for the OpenAI Realtime API, because the internal coordination is handled by the system rather than delegated to developers. Pricing is a flat $4.50 per hour versus roughly $18 per hour for comparable usage on the Realtime API. The Voice Agent API is purpose-built for speech, while the Realtime API serves a broader multimodal scope.

Do I need to use a coding agent to build with the Voice Agent API?

No. The API works with any WebSocket client in any language. However, coding agents like Claude Code or Cursor make it the fastest path — you paste the setup prompt, describe what you want, and the coding agent produces a working integration in minutes. The API surface is small enough that the generated code is short, readable, and easy to modify.

What use cases work well with voice agents?

Voice agents are effective for any scenario where a spoken conversation replaces a human interaction. Common use cases include customer support, appointment scheduling, lead qualification, clinical intake, sales coaching and training, and order status inquiries. The key requirement is that the conversation follows patterns structured enough for an AI agent to handle reliably.

How do I get started with the Voice Agent API?

Sign up for a free AssemblyAI account at assemblyai.com/dashboard/signup, grab your API key, and open the Voice Agent API docs. From there, paste the setup prompt into a coding agent like Claude Code or Cursor, describe the voice agent you want to build, and you can have a working prototype the same afternoon.