April 29, 2026

When to use Voice Agent API vs. Universal-3 Pro Streaming

Voice agent API guide: compare voice agent APIs with U-3 Pro Streaming and learn how to evaluate latency, tool calling, speech accuracy, and production fit.

Kelsey Foster

Growth

AI voice agents

Universal-3 Pro Streaming

Voice Agent API

Reviewed by

Table of contents

[Visible on live site]

AssemblyAI now offers two paths for building voice agents. The Voice Agent API handles the full pipeline—speech recognition, LLM reasoning, and voice synthesis—over a single WebSocket connection. Universal-3 Pro Streaming is a standalone speech-to-text model you plug into your own stack when you already have an LLM and TTS provider. Both are built on the same underlying speech recognition technology, but they solve different problems.

This article explains how each product works, when to choose one over the other, and what the core capabilities of the Voice Agent API mean in practice—turn detection, interruption handling, tool calling, and session resumption. If you're deciding between building on top of existing infrastructure or letting AssemblyAI handle the full voice pipeline as invisible infrastructure, this is the comparison that matters.

Voice Agent API vs. Universal-3 Pro Streaming: which one do you need?

The decision comes down to one question: do you want to own the full voice pipeline, or do you want someone else to handle it?

Choose the Voice Agent API when you want the fastest path to a working voice agent. You connect to a single WebSocket, stream audio in, and get audio back—STT, LLM, and TTS all handled for you at a flat rate of $4.50/hr. No separate providers to manage, no orchestration framework to learn. Most developers get a working agent running the same afternoon. It's Voice AI infrastructure that's invisible to your end users—they feel like you built the whole thing from scratch.

Choose Universal-3 Pro Streaming standalone when you already have your own LLM and TTS stack and only need the speech-to-text layer. At $0.45/hr for STT only, it gives you ~300ms transcript latency, unlimited concurrent streams, and autoscaling included. You're responsible for stitching together the pipeline, handling turn detection, and managing multiple provider relationships—but you get full control over every component.

Here's the comparison at a glance:

	Voice Agent API	U3 Pro Streaming (standalone)	U3 Pro Streaming + orchestrator
What it provides	Full pipeline: STT + LLM + TTS	STT only	STT + orchestration framework
Pricing	$4.50/hr flat (all-in)	$0.45/hr + your LLM + your TTS	$0.45/hr + LLM + TTS + infra
Turn detection	Built-in, speech-aware VAD	You build it	Orchestrator handles it
Interruption handling	Built-in barge-in	You build it	Orchestrator handles it
Tool calling	Built-in via JSON Schema	You build it	Varies by framework
Pipeline control	Configuration-level	Full code-level	Framework-level
Integration effort	Hours (one WebSocket)	Days–weeks (build pipeline)	Days (learn framework)
Best for	Ship fast, own the product	Existing stack, full control	Framework users (LiveKit, Pipecat)

There's also a middle path. If you're already using an orchestration framework like LiveKit or Pipecat, AssemblyAI ships official plugins for both. You get Universal-3 Pro Streaming's accuracy with the orchestrator handling turn detection and pipeline management—a supported path that sits between the full Voice Agent API and raw standalone STT.

What is a voice agent API?

A voice agent is software that listens to spoken input, understands it, and responds with generated speech—enabling real-time voice conversations between a person and an AI. Unlike the old phone menus that forced you to press numbers, a voice agent handles natural, back-and-forth conversation.

A voice agent API is the interface that gives developers programmatic access to build these voice conversations. Think of it as the set of instructions your code uses to connect to the full voice pipeline—transcribing what someone says, figuring out how to respond, and speaking that response out loud.

But here's where the term gets important: traditionally, building a voice agent meant connecting three separate providers. One for speech-to-text, one for the language model, and one for text-to-speech. A voice agent API replaces all three with a single connection. You send audio in, and spoken responses come back out—without managing three different SDKs, three billing systems, or three sets of logs.

There's also a distinction worth knowing between a voice agent API and a voice agent platform. Platforms like Retell or Vapi are complete solutions with no-code interfaces, pre-built templates, and managed call routing—great for teams that want to move fast without writing code. APIs are raw infrastructure you build on top of, which gives you complete control over what the product looks like and how it behaves.

The quality of the speech-to-text model at the input stage determines everything downstream. If the agent mishears your user, the language model responds to the wrong thing—and the whole conversation breaks down from there. This is why AssemblyAI's Voice Agent API is built on Universal-3 Pro Streaming.

How AssemblyAI's Voice Agent API works

The core loop is straightforward: audio comes in, gets converted to text, passes through a language model that generates a response, and converts back to speech that streams out to the user. The entire pipeline runs over a single WebSocket connection—a persistent, two-way connection between your application and the server, meaning audio streams in and responses stream back in real time without reopening the connection for each message.

So what makes this feel like a real conversation rather than a slow call-and-response system? Four capabilities:

Turn detection: The API determines when you've finished speaking so the agent knows when to respond—not when you pause, but when you're actually done.
Interruption handling: If you speak while the agent is responding, it stops immediately and listens. This is called barge-in, and without it, voice agents feel robotic.
Tool calling: The agent can execute real functions during the conversation—checking an account balance, booking a time slot, or looking up a product.
Session resumption: If the connection drops, you can reconnect within a 30-second window and the conversation context is preserved, so the user doesn't have to start over.

Turn detection

Turn detection is the mechanism that figures out when you've finished speaking. This sounds simple, but it's one of the hardest problems in Voice AI. People pause mid-sentence to think. They trail off. They leave long gaps before finishing a complex question. A system that mistakes a pause for an end-of-turn will interrupt you constantly.

AssemblyAI's Voice Agent API uses speech-aware voice activity detection (VAD) that distinguishes thoughtful pauses from conversation endings. It combines acoustic signals with conversational context—so it can tell the difference between "I need to check my... account number" and a sentence that's genuinely finished. You can also adjust VAD settings mid-conversation without reconnecting, which matters for workflows where conversation pace varies.

Interruption handling

When you speak while the agent is responding, the agent needs to stop immediately. This is called barge-in—and it's what makes a voice agent feel like a conversation rather than a recording.

In AssemblyAI's Voice Agent API, when the user interrupts, the server emits a reply.done event with a status of "interrupted" and stops generating audio. Your client code flushes the audio buffer and gets ready for the next response. If you've ever used a voice interface that keeps talking over you no matter what you say, you've experienced what happens when barge-in isn't implemented properly.

Tool calling

Tool calling is what separates a voice agent from a voice chatbot. Without it, the agent can talk—but it can't do anything. With tool calling, the agent executes real functions mid-conversation.

Here's how it works: you register tools using JSON Schema when you configure your session. Each tool is a function your server can run—look up a record, check inventory, confirm a booking. When the agent decides a tool is needed, it calls your function and speaks a natural transition phrase while it waits: "Let me check that for you." Once your function returns a result, the agent incorporates it into its response.

Tools can be updated mid-conversation without reconnecting, which matters for flows where available actions depend on conversation state.

Session resumption

Network connections drop. Session resumption means the user doesn't pay for it. If the WebSocket disconnects, you can reconnect within a 30-second window and resume with the full conversation context intact—no repeated introductions, no lost information.

Hear the difference accuracy makes

Stream audio in your browser and compare partial and final transcripts side by side. See why Universal-3 Pro Streaming is the foundation for production voice agents.

Try playground

What you can build with the Voice Agent API

The Voice Agent API works for any application where voice is the natural interface. The most common production deployments break into a few clear categories:

Customer support agents

Voice agents that handle routine inquiries—password resets, order status, billing questions—without a human on the line. The key requirement here is speech accuracy. If the agent mishears an account number, product code, or email address, the interaction fails immediately.

Entity recognition matters a lot in this context: the difference between hearing "B as in boy" and "D as in dog" determines whether the customer gets helped or gives up. Universal-3 Pro's 92.7% mixed-entity accuracy is purpose-built for exactly this problem—high-quality speech recognition trained on real-world audio with accented speech, background noise, and overlapping sound performs dramatically better than generic models trained on clean recordings.

AI companions and coaching apps

These are conversational applications where the interaction itself is the product. Language practice, sales training simulations, interview prep tools, and mental wellness check-ins all fall into this category.

What's different here is the priority. These use cases care less about entity recognition and more about natural conversation flow. Turn detection quality matters more—the agent needs to feel like a good conversation partner, not a system waiting for a keyword. The Voice Agent API's speech-aware VAD and intelligent barge-in handling are designed for exactly this kind of natural back-and-forth.

Clinical and intake workflows

Voice interfaces for patient intake, triage documentation, and clinical data capture come with the highest accuracy requirements of any Voice AI use case. Medical terminology, medication names, and dosages must be transcribed correctly—there's no tolerance for error when documenting a prescription.

The Voice Agent API supports Medical Mode, which enhances transcription accuracy for clinical terminology, and can be combined with keyterms prompting for proper-noun accuracy on specific drug names and procedures. These workflows also involve protected health information. AssemblyAI enables covered entities and their business associates subject to HIPAA to use AssemblyAI services to process protected health information (PHI). AssemblyAI is considered a business associate under HIPAA and offers a Business Associate Addendum (BAA) required under HIPAA to ensure PHI is appropriately safeguarded.

How to choose a voice agent API

When you're evaluating options, these are the criteria that matter in production—not features listed on a pricing page, but the things that determine whether your product works and your team can maintain it.

Speech accuracy: The speech-to-text model is the foundation. Errors at the transcription layer cascade through the entire pipeline. Look for real-world benchmark results on mixed entities—not just word error rate on clean-audio test sets. AssemblyAI's Voice Agent API runs on Universal-3 Pro Streaming.

Latency: Conversations feel natural under one second of end-to-end response time. Over two seconds, they feel broken. The Voice Agent API targets ~1 second end-to-end latency. Ask any provider for specific p50 and p99 numbers—not ranges, not theoretical minimums.

Developer experience: Can you read the full API reference in one sitting? Can you get a working agent running in an afternoon? The Voice Agent API is standard JSON over a WebSocket—no proprietary SDK required. Copy the docs, paste them into Claude Code, and build what you're imagining. If that workflow doesn't work, the API isn't developer-friendly enough.

Pricing model: Token-based billing across three separate providers—STT, LLM, TTS—makes cost modeling nearly impossible. The OpenAI Realtime API, for example, runs roughly $18/hr with per-token billing across 30+ event types. The Voice Agent API is $4.50/hr flat, covering everything. One bill measured in hours, not token math across three invoices.

Customization and control: Can you update the system prompt, available tools, and voice mid-conversation? The Voice Agent API supports live mid-session updates to all of these—no reconnection, no redeployment. If you need even deeper control over individual pipeline components, that's when Universal-3 Pro Streaming standalone or an orchestration framework makes more sense.

Observability: Debugging voice agents is hard enough. Debugging across three providers with three separate log streams is painful. Unified logging is a genuine quality-of-life improvement that compounds as your product scales.

Build with the Voice Agent API

Create an account to integrate a single WebSocket that unifies STT, LLM, and TTS with flat, predictable pricing at $4.50/hr.

How to get started with the Voice Agent API

The basic pattern for connecting to AssemblyAI's Voice Agent API follows five steps:

Connect to the WebSocket endpoint
Send session.update with your system prompt, voice selection, and registered tools
Stream PCM16 audio via input.audio messages
Receive reply.audio chunks and play them back
Handle events that signal conversation state changes

Here's what a minimal connection and session configuration looks like:

const ws = new WebSocket('wss://agents.assemblyai.com/v1/voice');

ws.onopen = () => {
  ws.send(JSON.stringify({
    type: 'session.update',
    session: {
      system_prompt: 'You are a helpful support agent.',
      voice: 'ivy',
      tools: [{
        type: 'function',
        name: 'lookup_account',
        description: 'Look up a customer account by email',
        parameters: {
          type: 'object',
          properties: { email: { type: 'string' } },
          required: ['email']
        }
      }]
    }
  }));
};

No SDK required. This is a standard WebSocket and JSON—it works natively in any language or framework. You can also paste the full API reference into Claude Code and have it scaffold your integration for you.

The event flow follows a consistent sequence: audio arrives → speech is detected → transcript is generated → reply starts → audio streams out → reply completes. When a tool call is triggered, the agent pauses naturally while your function executes, then incorporates the result into its response.

Event	Direction	What it signals
`session.ready`	Server → Client	Connection established, ready for audio
`input.speech.started`	Server → Client	User began speaking
`input.speech.stopped`	Server → Client	User finished speaking
`transcript.user`	Server → Client	User's speech transcribed
`reply.started`	Server → Client	Agent beginning response
`reply.audio`	Server → Client	Audio chunk ready to play
`reply.done`	Server → Client	Agent response complete (or interrupted)
`tool.call`	Server → Client	Function execution needed

The fastest way to know whether the Voice Agent API is right for your use case is to talk to it. If the conversation feels natural—if the turn detection is tight, the voice sounds human, and responses come back quickly—the technical integration will follow. Try the live demo before writing any code.

Final words

The question isn't whether you need voice—it's how much of the pipeline you want to own. If you already have an LLM and TTS stack you're happy with, Universal-3 Pro Streaming gives you the industry's best speech-to-text layer at $0.45/hr and lets you keep control of everything else. If you'd rather focus on what your agent does instead of how the audio flows, the Voice Agent API handles the full pipeline at $4.50/hr over a single WebSocket—invisible infrastructure that lets your product be the star.

Both paths are built on the same Universal-3 Pro Streaming foundation. The gap between them will keep narrowing as the Voice Agent API adds more customization surface and the standalone STT gains richer framework integrations. For now, the right choice depends on where your team wants to spend its engineering time: building pipeline plumbing, or building the product that sits on top of it.

Start building your voice agent today

Set up your API key and launch a session over one WebSocket using standard JSON—no SDK required. $4.50/hr, everything included.

Frequently asked questions

What's the difference between a voice agent API and a voice agent platform?

A voice agent API is raw infrastructure—a WebSocket connection you build on top of to create your own product. A voice agent platform includes pre-built UIs, no-code configuration, and managed call routing. APIs give you more control; platforms give you faster initial setup.

When should I use Universal-3 Pro Streaming instead of the Voice Agent API?

Use Universal-3 Pro Streaming standalone when you already have your own LLM and TTS providers and want to own the full pipeline. At $0.45/hr for STT only, it's the right choice when you need code-level control over every component. Use the Voice Agent API when you want the full pipeline handled for you at $4.50/hr flat.

Which languages does AssemblyAI's Voice Agent API support?

AssemblyAI's Voice Agent API supports English, Spanish, French, German, Italian, and Portuguese, with speech recognition powered by Universal-3 Pro across all six languages.

How is the Voice Agent API priced compared to managing separate STT, LLM, and TTS providers?

Managing three separate providers means three invoices, three usage dashboards, and token math that's hard to forecast. The OpenAI Realtime API, for comparison, runs roughly $18/hr with per-token billing. AssemblyAI's Voice Agent API is $4.50/hr flat—one bill measured in hours, covering STT, LLM, and TTS.

How long does it take to get a working voice agent running with this API?

Most developers get a working agent running in an afternoon. The API is a standard WebSocket and JSON—no proprietary SDK to learn, and the full API reference is short enough to read in one sitting.

Can a voice agent API handle medical use cases that involve protected health information?

Yes, with the right provider. AssemblyAI enables covered entities and their business associates subject to HIPAA to use AssemblyAI services to process protected health information (PHI) and offers a Business Associate Addendum (BAA) required under HIPAA. The Voice Agent API also supports Medical Mode for enhanced clinical terminology accuracy.