Insights & Use Cases
June 15, 2026

AssemblyAI vs Deepgram for voice agents

AssemblyAI and Deepgram both offer cascaded voice agent infrastructure at around $4.50/hr—on paper, they look alike. But the details that decide production outcomes—entity accuracy, developer experience, billing model, and streaming STT quality—tell a different story. Here's an honest, data-driven comparison of both their voice agent APIs and their standalone streaming STT products.

Kelsey Foster
Growth
Reviewed by
No items found.
Table of contents

Both AssemblyAI and Deepgram now offer dedicated voice agent infrastructure. Both use cascaded architectures—separate STT, LLM, and TTS models working in sequence rather than a single multimodal model. Both charge around $4.50/hr for their full-pipeline APIs. On paper, they look pretty similar.

But when you dig into the details that actually matter for production voice agents—speech accuracy on real-world entities, developer experience, pricing models, and streaming STT quality—meaningful differences emerge.

This comparison covers both their voice agent APIs and their standalone streaming speech-to-text products. We'll be honest, specific, and data-driven. Here's what you need to know.

The architecture: similar approach, different foundations

AssemblyAI and Deepgram both chose the cascaded pipeline architecture for their voice agent APIs: speech-to-text feeds into an LLM, which feeds into TTS. This is the right call. Dedicated models for each step outperform multimodal approaches on speech understanding tasks because each model is optimized for its specific job. For a deeper look at how this voice agent architecture works, we've covered that separately.

The key difference is the STT foundation. AssemblyAI's Voice Agent API is built on Universal-3 Pro Streaming—the #1-ranked model on the Hugging Face Open ASR Leaderboard. Deepgram uses their Nova-3 model. Both are capable speech models, but they perform very differently where it counts most for voice agents: entity accuracy.

Voice Agent API comparison

Let's start with the head-to-head on their full-pipeline voice agent APIs. This is the all-in-one option where you send audio in and get audio back through a single connection.

Feature AssemblyAI Voice Agent API Deepgram Voice Agent API
Pricing $4.50/hr flat ~$4.50/hr + concurrency metering
ASR model Universal-3 Pro Streaming (#1 WER) Nova-3
Word accuracy 94.07% (6.3% mean WER) 92.10%
Missed entity rate 16.7% 25.5%
End-to-end latency ~1 second ~1–1.5 seconds
Languages EN, ES, FR, DE, IT, PT EN, ES, NL, FR, DE, IT, JA
Turn detection Speech-aware VAD (semantic + neural) Traditional VAD
Mid-session updates Prompt + voice + tools + VAD Prompt + voice only
Session resumption 30-second reconnect window Not available
Billing model Flat per-minute Concurrency metering
Tool calling Custom functions via JSON Schema Custom functions supported
Compliance SOC 2 Type 2, BAA available, ISO 27001 SOC 2
Medical terminology Medical Mode available No equivalent

The table tells one story. But the numbers that really matter are in the accuracy column.

Speech accuracy: where the gap shows up

Here's the thing about AI voice agents: they're not just transcribing for the record. The transcript feeds directly into the LLM that decides what to do next. If the speech-to-text layer gets an email address wrong, the agent sends a confirmation to the wrong person. If it misses a digit in an account number, the agent looks up the wrong account. Errors cascade.

AssemblyAI's Universal-3 Pro Streaming achieves 94.07% word accuracy with a 6.3% mean word error rate across English domains. On entity accuracy specifically, it has a 16.7% average missed entity rate on names, emails, phone numbers, and credit card numbers. Deepgram Nova-3's missed entity rate on the same content runs at 25.5%.

That's not a small gap. It's the difference between an agent that completes tasks on the first try and one that needs to ask "could you repeat that?" regularly.

The pharmacy refill scenario

Consider a real-world pharmacy refill call. A caller says their date of birth, prescription number, medication, dosage, address, and phone number—all in one natural flow.

AssemblyAI correctly transcribes "RX-7704132" and "Metoprolol 80mg"—the RX prefix is preserved, the dosage formatting is correct, and the street address and phone number come through clean. Deepgram's transcription outputs "dash seven seven zero four one three two" and "metoprolol eighty milligrams"—missing the RX prefix entirely and expanding the dosage into words instead of the alphanumeric format the downstream LLM expects.

This isn't cherry-picked. It reflects the systematic accuracy advantage that comes from building the entire voice agent pipeline around a purpose-built speech model. Across broader benchmarks, AssemblyAI shows 59.64% fewer missed emails and 34.79% fewer missed phone numbers compared to Deepgram Nova-3.

When your agent is confirming a shipping address or reading back an order number, those aren't rounding errors. They're task failures.

Test speech accuracy on your own audio

See how Universal-3 Pro Streaming captures entities, handles accents, and detects turns—then compare the results to what you're getting from Deepgram or any other provider.

Try playground

Streaming STT comparison: Universal-3 Pro Streaming vs Nova-3

Not everyone wants a full-pipeline voice agent API. Many teams are building their own cascading pipeline with a preferred LLM and TTS provider, and they just need the best streaming speech-to-text layer they can get. So how do AssemblyAI and Deepgram compare as standalone streaming STT products?

AssemblyAI Universal-3 Pro Streaming

Universal-3 Pro Streaming is priced at $0.45/hr and was purpose-built for voice agent deployments. Here's what makes it distinct:

Immutable transcripts in ~300ms. Every word emitted is final from the moment it arrives. No revision handling, no partials that get corrected later. Your agent can start thinking while the user is still talking, because the transcript it's reasoning over won't change.

Intelligent endpointing. This is the big one. Rather than relying solely on silence thresholds, Universal-3 Pro Streaming uses audio-contextual signals—tonality, pacing, punctuation patterns, and speech patterns—to determine when a speaker is done. It distinguishes "I'm thinking" from "I'm done talking." Traditional silence-based VAD can't do this.

Unlimited concurrent streams. No rate limits or upfront commitments. Scale from one call to millions on pay-as-you-go pricing.

Dynamic key term prompting. Boost recognition of up to 1,000 domain-specific terms—product names, medications, policy IDs—updated mid-conversation over the same WebSocket connection. If your agent handles both billing questions and technical support, you can shift the model's vocabulary as the conversation pivots.

Real-time speaker diarization. Track and label speakers inline as audio arrives. No post-processing pipeline required.

Framework integrations. Works natively with LiveKit, Pipecat, Vapi, Twilio, and any standard WebSocket client.

v3 streaming API with turn-based events. The event model is built around conversation turns, not just raw audio chunks. SpeechStarted, Turn events with end_of_turn signals—your agent pipeline gets structured conversation data, not a firehose of text fragments.

Deepgram Nova-3 Streaming

Deepgram's Nova-3 streaming API offers competitive pricing at a lower tier, traditional endpointing, and a solid developer ecosystem. It supports more languages than AssemblyAI's streaming product—including Dutch and Japanese—and has been a reliable choice for teams already in the Deepgram ecosystem.

But there are meaningful gaps. Independent benchmarks from Hamming.ai across 4M+ production calls show AssemblyAI with 41% faster median latency in word emission (307ms vs 516ms) and nearly 2x faster on P99 latencies (1,012ms vs 1,907ms). Nova-3 doesn't support mid-session prompting, neural turn detection, or anti-hallucination controls.

The endpointing difference

The thing is, for voice agents, endpointing quality might matter even more than raw accuracy. When your STT model fires an end-of-turn signal too early, the LLM starts generating a response while the user is still mid-sentence. Too late, and the conversation drags with awkward silences.

AssemblyAI's endpointing uses semantic + neural + VAD signals combined. It looks at punctuation patterns in the transcribed text to understand when sentences are actually complete. When silence reaches the minimum threshold, the model transcribes the audio and checks for terminal punctuation. Period, question mark, or exclamation point? The turn ends. No terminal punctuation? A partial is emitted and the turn continues.

Deepgram uses traditional silence-based VAD. It works, but it can't distinguish a thoughtful pause from a conversation ending. This directly impacts conversation flow quality—and it's the kind of difference you notice immediately when you talk to agents built on each platform.

Entity accuracy across the board

Universal-3 Pro Streaming has a 16.7% average missed entity rate versus 25.5% for Deepgram Nova-3 across names, emails, phone numbers, and credit card numbers. That's an 8.8 percentage-point gap. For teams in healthcare, financial services, insurance, or any domain with specific vocabulary, that gap is the metric that predicts whether your agent works in production.

Ship your voice agent today

One WebSocket, a handful of JSON messages, and most developers ship the same day. Get started with free credits—no credit card required.

Sign up free

Developer experience

Both APIs use WebSocket connections and JSON messages—the fundamentals are similar. For a broader perspective on choosing an STT API for voice agents, we've covered that topic in depth. But the details of the developer experience differ in ways that compound over time.

AssemblyAI's approach is deliberately minimalist. A handful of JSON message types, no SDK required, and the entire API reference is readable in about 10 minutes. The team designed the API so it works natively with tools like Claude Code—you can literally copy the docs, paste them in, and scaffold a working integration. Most developers get a working agent running the same afternoon.

Mid-conversation updates are a major advantage. You can change the system prompt, swap voices, add or remove tools, and adjust VAD settings—all via a JSON message without dropping the connection. For applications that need dynamic behavior—a support agent escalating from English to Spanish, a coaching app switching modes—this is the kind of flexibility that saves weeks of engineering workarounds.

Deepgram's developer experience is solid but more conventional. Their documentation is well-organized, and the API follows patterns familiar to developers who've used their transcription products. If you're already building on Deepgram, adding their voice agent API is a natural extension. But the mid-session update surface is narrower: prompt and voice only, without tool or VAD modifications.

Pricing and scaling

Both APIs come in at roughly $4.50/hr for the full pipeline (STT + LLM + TTS). But the billing models have differences worth understanding before you commit.

AssemblyAI uses straightforward per-minute billing with a flat rate. $4.50/hr covers everything—speech understanding, LLM reasoning, and voice generation. No separate input/output token charges, no per-feature add-ons. Your cost model is simple: hours of usage times $4.50.

Deepgram uses concurrency metering alongside usage-based pricing. Your costs depend not just on total usage but on how many simultaneous sessions you're running. For applications with bursty traffic patterns—a contact center during peak hours, for instance—concurrency metering can make costs harder to predict and potentially more expensive during spikes.

For standalone streaming STT, AssemblyAI's Universal-3 Pro Streaming runs $0.45/hr with no concurrency caps and unlimited autoscaling. Deepgram's streaming pricing sits at a lower tier, which can be attractive for cost-sensitive deployments where entity accuracy isn't the primary concern.

The math is straightforward: if your voice agent handles 10,000 hours per month, AssemblyAI's full-pipeline cost is $45,000 with no surprises. With concurrency metering, your bill depends on traffic patterns that are harder to model in advance.

Turn detection and conversation flow

This is where voice agents live or die in production—and it's hard to evaluate from docs alone.

AssemblyAI's Voice Agent API uses a speech-aware Voice Activity Detection system that combines semantic, neural, and acoustic signals. The turn detection is baked into Universal-3 Pro Streaming, which means it benefits from the same deep speech understanding that drives transcription accuracy. When someone pauses to think, the agent waits. When someone is genuinely done talking, the agent responds promptly.

Barge-in handling is intelligent too. Back-channel sounds like "uh-huh" or "mmhmm" don't trigger interruption—the agent recognizes these as acknowledgments, not attempts to take the floor. But genuine interruptions like "wait, stop" are handled immediately.

Deepgram also offers turn detection and VAD, but relies on traditional silence-based approaches. Developers building on both platforms have noted that AssemblyAI's implementation feels more natural in practice—particularly around the "pause vs. done talking" distinction that makes or breaks conversation flow.

The honest recommendation: try both. Have a real conversation with agents built on each platform. The difference in conversational feel is something you notice immediately, even if it's hard to quantify in a feature table.

Healthcare and compliance

If you're building voice agents for healthcare, the comparison tilts heavily toward AssemblyAI.

AssemblyAI's Medical Mode enhances transcription accuracy for clinical terminology—medication names, dosages, procedures, anatomical terms. For voice agents handling patient intake, prescription refills, or clinical documentation, this is the difference between "metoprolol 80mg" and "metoprolol eighty milligrams." The LLM downstream needs the structured format to take the right action.

On compliance, AssemblyAI holds SOC 2 Type 2 and ISO 27001 certifications. AssemblyAI is considered a business associate under HIPAA and offers a Business Associate Addendum (BAA) for enterprise customers processing protected health information (PHI). Deepgram holds SOC 2 certification but doesn't offer an equivalent to Medical Mode or the same depth of compliance infrastructure.

When to choose each

Choose AssemblyAI when

Speech accuracy is your top priority—and for most production voice agents, it should be. If your agent needs to capture email addresses, phone numbers, account IDs, medical terms, or any structured entity correctly on the first try, the 16.7% vs 25.5% missed entity rate gap is the number that matters. Also choose AssemblyAI when you want flat, predictable pricing without concurrency metering, when you need healthcare or medical terminology support, when mid-session flexibility to update prompts, tools, voice, and VAD is important, or when you want the option to use either the full Voice Agent API or Universal-3 Pro Streaming as a standalone STT layer within the same platform. For more on the voice AI stack for building agents, we've broken down the architecture decisions in detail.

Choose Deepgram when

Ifyou're already deeply integrated with Deepgram's ecosystem and the migration cost outweighs the accuracy gains for your specific use case.

But be clear-eyed about what you're trading. In the Voice Agent Report, 76% of developers rated accuracy as the most critical non-negotiable when choosing a voice stack. Entity accuracy in particular is the metric that predicts whether the agent gives the right answer downstream.

Build voice agents with the best speech accuracy

The lowest missed entity rate in the market, flat $4.50/hr pricing, and an API most developers ship with the same day. Start building with clear docs and a free account.

Sign up free

The bottom line

Both AssemblyAI and Deepgram are capable platforms for building voice agents. They made similar architectural choices—cascaded pipelines, competitive pricing, WebSocket-first APIs. That's not a coincidence; it's the right approach for production voice agents today.

But when speech accuracy determines whether your agent completes tasks on the first try, the data points to AssemblyAI. A 94.07% word accuracy rate, 59.64% fewer missed emails, 34.79% fewer missed phone numbers, and an 8.8 percentage-point advantage in overall entity accuracy—these aren't benchmark curiosities. They're the metrics that determine whether a customer has to repeat themselves, whether an order gets shipped to the right address, whether a prescription refill goes through correctly.

Whether you use the Voice Agent API for the fastest path to production or Universal-3 Pro Streaming for full architectural control within your existing pipeline, you're building on the same speech foundation. And it's the most accurate one available.

Frequently asked questions

Which has better speech accuracy for voice agents—AssemblyAI or Deepgram?

AssemblyAI's Universal-3 Pro Streaming achieves 94.07% word accuracy with a 16.7% missed entity rate, compared to Deepgram Nova-3 at 92.10% word accuracy and 25.5% missed entity rate. The gap is especially significant on structured entities like email addresses (59.64% fewer missed), phone numbers (34.79% fewer missed), and alphanumeric strings like prescription and order IDs. Since voice agent transcripts feed directly into LLMs for reasoning and action, higher entity accuracy means fewer cascading errors and more tasks completed on the first try.

How does pricing compare between AssemblyAI and Deepgram for voice agents?

Both voice agent APIs cost approximately $4.50/hr for the full STT + LLM + TTS pipeline. The key difference is billing model: AssemblyAI uses flat per-minute billing with no concurrency caps, making costs predictable at any scale. Deepgram uses concurrency metering, which means costs depend on how many simultaneous sessions are running—this can make budgeting harder during traffic spikes. For standalone streaming STT, AssemblyAI's Universal-3 Pro Streaming is $0.45/hr with unlimited concurrent streams.

Can I use AssemblyAI's streaming STT with my own LLM and TTS?

Yes. Universal-3 Pro Streaming works as a standalone streaming speech-to-text layer that you can integrate into any cascading pipeline. It has native integrations with LiveKit, Pipecat, Vapi, and Twilio, and works with any standard WebSocket client. At $0.45/hr, it gives you full architectural control over your LLM and TTS choices while still providing the same speech accuracy foundation that powers the Voice Agent API.

Which voice agent API is better for healthcare applications?

AssemblyAI has a clear advantage for healthcare. Medical Mode enhances transcription accuracy on clinical terminology, medication names, and dosages. AssemblyAI is considered a business associate under HIPAA and offers a Business Associate Addendum (BAA) for customers processing protected health information (PHI). The platform also holds SOC 2 Type 2 and ISO 27001 certifications. Deepgram holds SOC 2 certification but does not offer an equivalent to Medical Mode or BAA support.

What is the difference between AssemblyAI's Voice Agent API and Universal-3 Pro Streaming?

The Voice Agent API is a fully managed pipeline—one WebSocket handles STT, LLM reasoning, and TTS at $4.50/hr. It's the fastest path from idea to production. Universal-3 Pro Streaming is the standalone streaming STT layer at $0.45/hr that you drop into your own pipeline with your preferred LLM and TTS providers. Both are built on the same Universal-3 Pro speech foundation, so transcription quality is consistent regardless of which path you choose. Teams often prototype with the Voice Agent API for speed, then move to Universal-3 Pro Streaming as their architecture matures and they want full control.

How do AssemblyAI and Deepgram handle turn detection differently?

AssemblyAI uses speech-aware turn detection that combines semantic, neural, and traditional VAD signals. It analyzes punctuation patterns, tonality, and pacing to determine when a speaker is genuinely done talking versus just pausing to think. Barge-in handling is intelligent—back-channel sounds like "uh-huh" don't trigger interruption, but genuine interruptions are handled immediately. Deepgram uses traditional silence-based VAD, which triggers end-of-turn based on silence duration alone. This can lead to premature cutoffs during thinking pauses or delayed responses when silence thresholds are set too high.

Title goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Button Text
AI voice agents