May 19, 2026

Using the Voice Agent API alongside an existing voice stack

If you're already on LiveKit or Pipecat with Deepgram, you don't have to rebuild. Here's exactly what to evaluate — and what to change — to get Universal-3 Pro accuracy without touching your architecture.

Devon Malloy

Staff Growth Manager

AI voice agents

Voice Agent API

Universal-3 Pro Streaming

Streaming Speech-to-Text

Voice AI

Reviewed by

Table of contents

[Visible on live site]

The standard sales playbook for a new voice agent platform tells you your existing architecture is wrong and you should rebuild from scratch.

This post doesn't do that.

If you're on LiveKit with Deepgram today, you made a reasonable decision with the information you had. You've written the glue code. You know which turn detection thresholds work for your use case. You've debugged the STT latency issues. You have working production deployments. Rebuilding that from zero has a real cost—and no one should ask you to eat it just because a new provider launched.

This post is written specifically for you: the developer who has already committed to a cascaded stack and is evaluating whether to change anything, and if so, what.

The short answer: you don't have to change your architecture to get Universal-3 Pro's accuracy. You just have to change one line.

The two lanes, named honestly

AssemblyAI's Voice AI offering has two distinct paths. Neither is the "real" product and neither is the consolation prize—they're designed for genuinely different situations.

Lane 1: Voice Agent API ($4.50/hr)

One WebSocket. Everything managed. AssemblyAI handles speech-to-text, turn detection, LLM routing, and TTS. This lane is for new builds—teams that want to skip the infrastructure layer entirely and go straight to product. If you're starting from zero, or if you want a proof-of-concept running in an afternoon, this is the path.

Lane 2: Universal-3 Pro Streaming STT ($0.45/hr)

This is for teams already on LiveKit, Pipecat, Vapi, or any other orchestration framework. Plug the Universal-3 Pro Streaming model into your pipeline as the STT layer. Keep everything else—your LLM, your TTS, your turn detection logic, your existing integrations. Better accuracy immediately, without rebuilding anything.

Both paths run on the same underlying model. The listening layer is identical. The architecture around it is your choice.

One of our enterprise customers—a team building highly custom voice workflows—put it plainly: "We wouldn't use a centralized give-us-a-prompt-we-wire-the-rest API. Our stack is highly custom. But we'd be interested in modular APIs we can plug into our own pipeline." Lane 2 exists exactly for that team.

Why STT is still the lever worth pulling

You control your LLM. You control your TTS. But your STT model is the foundation—it determines what your LLM gets to reason about.

If the transcription is wrong, everything downstream inherits the error. The LLM can't recover a hallucinated entity from a bad transcript. The TTS can't fix an answer that was based on a misheard account number. Accuracy at the listening layer isn't just a nice-to-have—it's the input quality constraint for your entire pipeline.

Swapping Deepgram Nova-3 for Universal-3 Pro Streaming doesn't require rebuilding anything. It's an endpoint change. What you get:

Accuracy on the things that actually matter. In independent benchmarks, Universal-3 Pro shows a 16.7% missed entity rate versus 25.5% for Deepgram—every prescription number, account ID, product name, and customer surname is more likely to be transcribed correctly. In the Voice Agent Report, 76% of developers rate accuracy as the most critical non-negotiable when choosing a voice stack. Entity accuracy in particular is the metric that predicts whether the agent gives the right answer downstream. (For background on how accuracy is measured, see our primer on word error rate.)

Promptable STT. You can tell the model what domain it's in and what vocabulary to expect—dynamically, updated mid-conversation. If your agent handles both billing questions and technical support, you can shift the model's context as the conversation pivots. No major STT provider currently offers this.

Keyterms boosting. Specific words—product names, competitor names, internal terminology—can be explicitly boosted. The model pays more attention to them.

Six languages with consistent accuracy. No degradation on non-English calls.

Here's how the two lanes line up at a glance:

CapabilityLane 1: Voice Agent APILane 2: Universal-3 Pro StreamingPrice$4.50/hr$0.45/hrBest forNew builds, fast proofs-of-conceptTeams already on LiveKit, Pipecat, VapiWhat's managedSTT, turn detection, LLM routing, TTSSTT only—you keep the rest of your stackIntegration effortOne WebSocketEndpoint swap inside your existing pipeline

One insight worth naming directly: not every application benefits equally from improved entity accuracy. If your voice agent is scheduling appointments or handling simple FAQs, Deepgram Nova-3 may be genuinely sufficient. But if you're in healthcare, financial services, insurance, or any domain with specific vocabulary—where a misheard entity doesn't just cause confusion, it causes a wrong answer—that 8.8 percentage point gap in entity accuracy is the specific metric that predicts whether your agent works in production. It's not an edge case. It's the business-critical path. For healthcare deployments specifically, AssemblyAI is considered a business associate under HIPAA and offers a standard Business Associate Addendum (BAA) for customers processing PHI.

How to evaluate it without disrupting your stack

The evaluation doesn't require a migration decision. Run both in parallel and measure on your actual traffic.

Step 1: Grab an API key. Free tier, no credit card. Takes two minutes.

Step 2: Update your STT configuration. Both LiveKit and Pipecat have integration guides that walk through the specific parameters for Universal-3 Pro Streaming in the streaming docs. It's a configuration change—you're not touching your LLM routing, your TTS layer, or your turn detection logic.

Step 3: Run both in parallel for a week. Shadow the Universal-3 Pro output against your current real-time speech recognition output on the same audio streams. You don't need to switch production traffic over; you're just comparing transcription quality on real inputs.

Step 4: Measure what matters for your use case. Transcription accuracy on your specific entity types. Latency. Turn detection behavior. The last one sometimes surprises teams—better transcription at the STT layer can reduce false positive turn detections downstream, because the LLM is less likely to get confused by a garbled entity and pause for clarification. If your workflow depends on knowing who said what, speaker diarization is worth evaluating alongside it.

A practical note from enterprise evaluation playbooks: the most useful test isn't a generic accuracy benchmark—it's running the model against your hardest cases. Pull a set of calls where your current STT made errors, run them through Universal-3 Pro, and count how many of those specific errors are fixed. That's the number that actually tells you whether the switch is worth it for your application.

Test Universal-3 Pro on your own audio

Run your hardest calls through the model in the playground and see entity accuracy on the cases your current STT gets wrong.

Try playground

The evaluation is low-risk by design. If Universal-3 Pro doesn't outperform your current provider on your specific traffic, you've spent an API key and a week of shadow testing. If it does, you swap the endpoint and you're done.

When the Voice Agent API becomes interesting to you

This is the part where a typical sales post would try to convince you to migrate.

Instead: file this away for later.

If you ever start a new project from zero—a new product line, a proof of concept for a different use case, a V2 where you're rethinking the infrastructure—the Voice Agent API is the other lane. Some teams use it to validate a voice use case quickly, before committing to a fully orchestrated stack. Others use it permanently for simpler workflows and maintain a custom stack for more complex ones. Adjacent use cases like meeting intelligence and AI notetakers often start in Lane 1 for the same reason.

The two lanes aren't mutually exclusive long-term. You may use both for different products or different stages of your build. The speech-to-text layer is the same either way.

For now: Lane 2 is the immediate lever. The Voice Agent API is something to know exists when the next greenfield opportunity comes up.

The work you did doesn't need to be undone

You built something real. The glue code, the threshold tuning, the debugging—that's real infrastructure with real value. Nobody should tell you to throw it away.

The thing worth evaluating is narrower: whether the listening layer is as good as it can be. If the transcription is wrong, everything downstream inherits the error. That's true regardless of how good your LLM is, how polished your TTS sounds, or how carefully you've tuned your turn detection.

Lane 2 exists specifically for you. Same model. Same accuracy. Plugs into what you already have.

Either way, the listening is ours.

Frequently asked questions

Can I use Universal-3 Pro Streaming with LiveKit or Pipecat without rebuilding my voice agent?

Yes. Universal-3 Pro Streaming is designed to drop into existing orchestration frameworks like LiveKit, Pipecat, and Vapi as the STT layer. You keep your LLM, TTS, and turn detection logic in place—only the speech-to-text endpoint changes. See the docs for integration parameters.

What's the difference between AssemblyAI's Voice Agent API and Universal-3 Pro Streaming?

The Voice Agent API is a fully managed path—one WebSocket that handles STT, turn detection, LLM routing, and TTS for $4.50/hr. Universal-3 Pro Streaming is a standalone STT model at $0.45/hr that plugs into any orchestration framework. Both run on the same underlying listening model; the difference is how much of the stack AssemblyAI manages.

How does Universal-3 Pro compare to Deepgram Nova-3 on entity accuracy?

In independent benchmarks, Universal-3 Pro Streaming shows a 16.7% missed entity rate compared to 25.5% for Deepgram Nova-3—an 8.8 percentage point gap on entities like account IDs, prescription numbers, and customer surnames. For domain-specific voice agents in healthcare, finance, or insurance, that gap often determines whether the agent gives the right answer.

How do I evaluate Universal-3 Pro Streaming on production traffic without switching over?

Run a shadow test. Grab a free API key, point a parallel pipeline at Universal-3 Pro Streaming, and compare transcripts against your current STT on the same audio streams for about a week. Measure transcription accuracy on your hardest cases—calls where your current provider made errors—rather than generic benchmarks.

Is AssemblyAI suitable for healthcare voice agents that handle PHI?

AssemblyAI is considered a business associate under HIPAA and offers a standard Business Associate Addendum (BAA) for customers processing protected health information (PHI). To execute a BAA, contact the sales team via the contact page.

Where can I see pricing for streaming speech-to-text and the Voice Agent API?

Streaming speech-to-text with Universal-3 Pro is $0.45/hr and the managed Voice Agent API is $4.50/hr. Both are available on the public pricing page, with no upfront commits and unlimited concurrency.

Using the Voice Agent API alongside an existing voice stack

The two lanes, named honestly

Why STT is still the lever worth pulling

How to evaluate it without disrupting your stack

When the Voice Agent API becomes interesting to you

The work you did doesn't need to be undone

Frequently asked questions

Can I use Universal-3 Pro Streaming with LiveKit or Pipecat without rebuilding my voice agent?

What's the difference between AssemblyAI's Voice Agent API and Universal-3 Pro Streaming?

How does Universal-3 Pro compare to Deepgram Nova-3 on entity accuracy?

How do I evaluate Universal-3 Pro Streaming on production traffic without switching over?

Is AssemblyAI suitable for healthcare voice agents that handle PHI?

Where can I see pricing for streaming speech-to-text and the Voice Agent API?

7 best orchestration tools to build AI voice agents in 2026

How to build a voice agent that transfers to a human

Which voice agent API has the best developer experience? What to evaluate

Voice agent architectures explained: STT→LLM→TTS vs. speech-to-speech vs. one API

Build a free Stable Diffusion app with a GPU backend

Kaldi Speech Recognition for Beginners - A Simple Tutorial

What is LLM Gateway?

Model Context Protocol (MCP) - What it is, how it works, and why it matters

Using the Voice Agent API alongside an existing voice stack

The two lanes, named honestly

Why STT is still the lever worth pulling

How to evaluate it without disrupting your stack

When the Voice Agent API becomes interesting to you

The work you did doesn't need to be undone

Frequently asked questions

Can I use Universal-3 Pro Streaming with LiveKit or Pipecat without rebuilding my voice agent?

What's the difference between AssemblyAI's Voice Agent API and Universal-3 Pro Streaming?

How does Universal-3 Pro compare to Deepgram Nova-3 on entity accuracy?

How do I evaluate Universal-3 Pro Streaming on production traffic without switching over?

Is AssemblyAI suitable for healthcare voice agents that handle PHI?

Where can I see pricing for streaming speech-to-text and the Voice Agent API?

Related posts

7 best orchestration tools to build AI voice agents in 2026

How to build a voice agent that transfers to a human

Which voice agent API has the best developer experience? What to evaluate

Voice agent architectures explained: STT→LLM→TTS vs. speech-to-speech vs. one API

Build a free Stable Diffusion app with a GPU backend

Kaldi Speech Recognition for Beginners - A Simple Tutorial

What is LLM Gateway?

Model Context Protocol (MCP) - What it is, how it works, and why it matters