Power best-in-class voice agents

Ultra-fast and ultra-accurate streaming STT built for voice agents. Get 300ms immutable transcripts and intelligent endpointing so your agents feel more natural and finish tasks successfully.

Start Building Read the Docs

Two Solutions

Pick the API that fits your build

Different architectures, different tradeoffs. Both powered by industry-leading speech models.

Recommended

Voice Agent API

Our proprietary voice stack, built on Universal-3.5 Pro, via one WebSocket. Connect, stream audio in, get audio back — we handle the rest.

Best for

Best-in-class voice agents — the preferred way to build with AssemblyAI
Customer support agents, AI companions, clinical intake, language learning
Teams shipping fast — working agent in an afternoon, no infra to manage
Claude Code compatible — paste the docs and build anything

$4.50/hr — speech, LLM, and voice all included

Get started for free

Free tier available · No credit card required

Bring Your Own Stack

Universal-3.5 Pro Realtime STT API

The STT layer for your cascading voice agent architecture. Works natively with your preferred orchestrator.

Best for

Teams already using LiveKit, Pipecat, or Vapi as their orchestration layer
Teams running cascading architectures (STT → LLM → TTS)
High-scale deployments where margin and full control matter
Complex workflows with RAG, custom tooling, or proprietary LLMs
HIPAA, SOC 2 — bring your own compliance infrastructure

$0.45/hr — transcription only, unlimited concurrent streams

View integration docs

No concurrency caps · Autoscaling included

Voice Agent API Demo

Try it for yourself

Speak into your browser and watch your words appear in real time.

Try the Voice Agent API live. This support agent is built on the Voice Agent API — the same one you can ship with. Click to start talking and experience real-time Voice AI in action. Ask about our products, APIs, or docs.

Please note: This agent provides customer support for AssemblyAI products only. Do not share sensitive or non-public information.

Agent voice

AssemblyAI Support Agent

Explore in our Playground

Compare

Choose based on your architecture

Not sure which to pick? Use this to decide.

Features

Voice Agent API

AssemblyAI's proprietary voice stack

Universal-3.5 Pro Realtime STT API

Best-in-class STT for your stack

Industry-leading speech models

Unlimited concurrency

Enterprise grade reliability

Session-based pricing

Setup time

Working agent in an afternoon

Minutes to swap STT in an existing stack

Architecture

1 WebSocket · JSON messages · No frameworks required

Cascading (STT → LLM → TTS) — you own the full pipeline

LLM

Managed — update system prompt mid-conversation

Bring your own

Voice (TTS)

Included — select from natural-sounding voices

Bring your own

Pricing

$4.50/hr all-in — no token math across three invoices

$0.45/hr — STT only, unlimited concurrent streams

Integrations

LiveKit, Pipecat, any WebSocket client, Claude Code

LiveKit, Pipecat, custom WebSocket, Twilio SIP

Session resume

30-second reconnect window, context preserved

Via your orchestrator

Ready to plug into your voice‑agent stack

Pre-built integrations with step‑by‑step docs enabling quick implementation without disrupting existing workflows.

integration

LiveKit

integration

Vapi

integration

Pipecat

“The speed difference is immediately noticeable — our users see their conversations transcribed almost instantaneously. It feels so much more responsive than what we were using before.”

Jonathan Kim, Software Engineer

Building a Voice Agent?

Voice Agent API

Stream audio in, get audio back. We handle the rest with our proprietary voice stack, so you can focus on your product.

Learn More →

Universal-3.5 Pro Realtime

Universal-3.5 Pro Realtime gives your voice agents the accuracy, speed, and real-time control to handle real conversations at scale.

Learn More →

Sync Speech-to-Text API

Run your own VAD and turn detection? Send each finished utterance in one HTTP call and get a flagship-accuracy transcript back in ~134ms — no streaming session to manage.

Learn More →

Start Building

Explore our comprehensive docs with integration guides and best practices to optimize accuracy and latency for your application.

Learn More →

You may also be interested in

Voice agents for real-time agent assist Voice agents for meeting intelligence & AI notetakers Voice agents for field service operations Voice agents for customer support & contact centers Voice agents for sales & call intelligence Voice agents for healthcare Voice agents for financial services Voice agents for AI medical scribe & ambient documentation Voice agents for multilingual global support

Common questions

: It depends on how much of the stack you want to own. The Voice Agent API is a fully managed pipeline — one WebSocket handles STT, LLM, and TTS with ~1 second end-to-end latency, making it the fastest path from idea to production. Realtime Speech-to-Text API at $0.45 per hour gives you full architectural control: drop Universal-3.5 Pro Realtime into your existing orchestrator and bring your own LLM and TTS. Choose the Voice Agent API for speed and simplicity; choose Streaming STT when you need to customize every layer. A third option for the most sophisticated builds: if your orchestrator already handles VAD and turn detection, the Sync Speech-to-Text API transcribes each finished utterance in a single HTTP request (~134ms) — no streaming connection to manage at all.
: Both the Voice Agent API and Realtime Speech-to-Text API offer pre-built integrations with LiveKit, Pipecat, Claude Code, Python SDK, Twilio SIP, and Nextup SDK. The Voice Agent API connects via any standard WebSocket client with no proprietary SDK required, while Streaming STT works natively with LiveKit Agents, Pipecat pipelines, and custom WebSocket architectures. Step-by-step integration docs are available for each framework so you can go live without disrupting existing workflows.
: The Voice Agent API delivers ~1 second end-to-end latency (speech in to speech out) at a flat $4.50 per hour covering the full pipeline. Realtime Speech-to-Text API delivers transcript results in ~300ms at $0.45 per hour, billed on session duration — the time your WebSocket stays open, not audio duration. The Voice Agent API uses session-based flat-rate billing with no concurrency caps; Streaming STT also offers unlimited autoscaling concurrent streams. Both include a free tier to test before committing.
: Yes. Both APIs share the same Universal-3.5 Pro speech recognition foundation, so transcription quality stays consistent regardless of which path you choose. Developers often prototype with the Voice Agent API for speed, then move individual components to Streaming STT as their architecture matures and they want more control over LLM routing or TTS selection. Since both use WebSocket connections and standard JSON, switching requires updating your connection endpoint and message handling — not rebuilding from scratch.
: If you already use an orchestrator like LiveKit or Pipecat with your own LLM and TTS providers, Realtime Speech-to-Text API is the natural fit — it plugs directly into your pipeline as the STT layer with intelligent endpointing and immutable transcripts. If you're starting from scratch or want to replace a multi-vendor stack with a single API, the Voice Agent API eliminates integration complexity entirely. The solutions comparison table on the page above breaks down the differences across latency, architecture, pricing, and feature support to help you decide.