June 2, 2026

How to build with the Voice Agent API

The Voice Agent API handles the full voice agent pipeline — STT, LLM, TTS, turn detection, and tool calling — over a single WebSocket. This post walks through the key capabilities with code you can drop into a working project.

Kelsey Foster

Growth

AI voice agents

Voice Agent API

Reviewed by

Table of contents

[Visible on live site]

The Voice Agent API is a single WebSocket that handles the full voice agent pipeline — speech-to-text, LLM reasoning, text-to-speech, turn detection, tool calling, and session resumption. You send audio in. You get audio back. Everything in between is handled for you at a flat rate of $4.50/hr.

That simplicity is the point. Most teams building voice agents end up stitching together separate STT, LLM, and TTS providers — three invoices, three sets of logs, three debugging surfaces — before they've written a line of product logic. The Voice Agent API collapses all of that into one connection.

This post walks through the key capabilities and how to use them, with code you can drop into a working project.

Connect to the Voice Agent API

Open a WebSocket to wss://agents.assemblyai.com/v1/ws with a Bearer token, send a session.update with your config, and wait for session.ready. That's the whole handshake.

import asyncio
import json
import websockets

API_KEY = "YOUR_API_KEY"
URL = "wss://agents.assemblyai.com/v1/ws"

async def main():
    async with websockets.connect(
        URL,
        additional_headers={"Authorization": f"Bearer {API_KEY}"},
    ) as ws:
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "system_prompt": "You are a friendly support agent.
Keep replies to one or two sentences.",
                "greeting": "Hi there, how can I help today?",
                "output": {"voice": "ivy"},
            },
        }))

        async for raw in ws:
            event = json.loads(raw)
            if event["type"] == "session.ready":
                session_id = event["session_id"]
                print(f"Session ready: {session_id}")
                # Start streaming input.audio events here
            elif event["type"] == "transcript.user":
                print(f"User: {event['text']}")
            elif event["type"] == "transcript.agent":
                print(f"Agent: {event['text']}")

asyncio.run(main())

A few things to know:

Authentication is Authorization: Bearer <key>. The Bearer prefix is required for this endpoint.
Audio is PCM16 mono at 24 kHz, base64-encoded, sent as input.audio events.
For browser apps, mint a temporary token server-side and pass it as ?token=... instead of exposing your API key.
The voice catalog includes named voices like ivy, james, tyler, and sophie, plus multilingual options. See Voices for the full list.

The full event reference — every payload, every field — is at Events reference.

Try it: grab your API key and run the browser quickstart. You'll have a working agent in under five minutes.

How turn detection works

Turn detection is the thing people notice first — even if they can't articulate it. When it's off, the agent either cuts you off mid-sentence or sits in dead air for three seconds. When it's right, conversations just flow.

The Voice Agent API handles turn detection server-side using semantic signals, not just silence timers. It looks at what the user actually said to decide whether they're done speaking, pausing to think, or in the middle of rattling off a phone number. This is on by default with no configuration required.

A few behaviors worth knowing about:

Adaptive endpointing. The agent learns the user's speaking pace as the conversation progresses. Someone who pauses between thoughts gets more room. Someone who speaks crisply gets faster responses. Leave min_silence and max_silence unset to let this work.

Smart end-of-turn for structured answers. When the agent asks for a phone number, email, or date, turn detection adapts to the kind of answer expected. The agent won't cut someone off mid-digit-sequence, and it won't sit waiting after a clean "yes."

When you need to tune it. The defaults work for most conversational use cases. Override them when you have a specific UX to solve for:

import json

await ws.send(json.dumps({
    "type": "session.update",
    "session": {
        "input": {
            "turn_detection": {
                "vad_threshold": 0.5,
                "min_silence": 1500,   # patient, 
accessibility-friendly
                "max_silence": 5000,
                "interrupt_response": True
            }
        }
    }
}))

Use case	Suggested change
Snappy back-and-forth (sales, short Q&A)	Lower `min_silence` (e.g., 500)
Accessibility, slower speakers	Raise `min_silence` (e.g., 1500) and `max_silence` (e.g., 5000)
Noisy environments (contact center, drive-thru)	Raise `vad_threshold` (e.g., 0.6–0.7)
Fixed-script IVR where you don't want interrupts	`interrupt_response: false`

Note: Setting min_silence or max_silence explicitly disables the server's adaptive endpointing for the rest of the session. The values you provide are used as-is. If you want the agent to keep adapting to the user's pace, leave both fields unset.

See Turn detection and interruptions for the full explanation.

How interruption handling works

The Voice Agent API uses a multimodal model that looks at both the audio and the Universal-3 Pro transcript text to decide whether the user actually wants the floor. There's nothing to configure — it's on by default for every session.

Backchannels stay quiet. "Uh-huh," "yeah," "okay," "makes sense" — these let the agent keep talking. No interruption event fires.

True interruptions land immediately. "Wait, stop," "hold on," "that's not right" — the agent stops speaking, and the server emits reply.done with status: "interrupted".

Spelling and digit sequences are protected. Pauses inside a phone number, email, or account number don't trigger end-of-turn.

On the client side, the only thing you need to handle is flushing the audio playback buffer when an interruption lands:

async for raw in ws:
    event = json.loads(raw)
    if event["type"] == "reply.done" and event.get("status") ==
"interrupted":
        # User actually interrupted. Flush playback buffer.
        flush_playback()

If your code was wrapping interruption handling in custom logic to filter out "uh-huh," you can rip that out. The API handles it semantically. See Turn detection and interruptions for the full event flow.

Update your agent mid-conversation

Most fields on session.update are mutable after session.ready. You can send another session.update at any time to change system_prompt, input.turn_detection, input.keyterms, output.volume, or tools — without reconnecting.

This opens up a useful pattern: when the agent just asked a question, give the user more time to think, then snap back to the baseline once they answer.

baseline = {
    "vad_threshold": 0.5,
    "min_silence": 1000,
    "max_silence": 3000,
    "interrupt_response": True
}

waiting_for_answer = False

async def set_turn_detection(td):
    await ws.send(json.dumps({
        "type": "session.update",
        "session": {"input": {"turn_detection": td}}
    }))

async for raw in ws:
    event = json.loads(raw)
    if event["type"] == "transcript.agent" and event.get("text",
"").rstrip().endswith("?"):
        waiting_for_answer = True
        await set_turn_detection({**baseline, "min_silence": 2200, 
"max_silence": 6000})
    elif event["type"] == "transcript.user" and waiting_for_answer:
        waiting_for_answer = False
        await set_turn_detection(baseline)

Same trick works for "take your time" prompts, long menu reads, or anywhere a structured pause makes sense. You can also swap the system_prompt mid-conversation to change the agent's personality or instructions, register new tools on the fly, or adjust output.volume.

Add tool calling to your agent

The agent can do real things during a conversation. Register tools with JSON Schema in session.tools, and the agent calls them when appropriate — look up an account, check order status, book an appointment, trigger a workflow.

The flow: the agent emits a tool.call event when it wants to invoke a tool. You run it, then send a tool.result back. The key timing detail: send the result when reply.done is the latest event you've received, not earlier and not later.

import json

last_event = None
pending_tools = []

async def flush_if_idle():
    if last_event != "reply.done" or not pending_tools:
        return
    for tool in pending_tools:
        await ws.send(json.dumps({
            "type": "tool.result",
            "call_id": tool["call_id"],
            "result": json.dumps(tool["result"]),
        }))
    pending_tools.clear()

async for raw in ws:
    event = json.loads(raw)
    t = event.get("type")
    if t == "tool.call":
        # event["arguments"] is already a parsed dict
        result = run_tool(event["name"], event["arguments"])
        pending_tools.append({"call_id": event["call_id"],
"result": result})
        await flush_if_idle()
    elif t in ("reply.started", "input.speech.started"):
        last_event = t
    elif t == "reply.done":
        last_event = t
        if event.get("status") == "interrupted":
            pending_tools.clear()
        else:
            await flush_if_idle()

A few non-obvious details:

event["arguments"] is already a dict — no json.loads step needed.
result on tool.result must be a JSON-encoded string, not an object.
Always echo back the original call_id.
For long-running tools (transfers, payment auth), use execution_mode: "hold" so the agent stays silent while the operation runs.
Use reply.create to make the agent speak status updates during a hold ("still working on that transfer for you").

Full reference and execution-mode details in the Tool calling docs.

Go multilingual

The Voice Agent API understands six input languages: English, French, German, Italian, Portuguese, and Spanish. It can speak eleven languages for output — those six plus Hindi, Japanese, Korean, Mandarin, and Russian. The agent can speak a language it can't transcribe, which is useful for translation-style flows where the user speaks one language and the agent replies in another.

Pick a voice that matches the language and accent you want. American and British English voices like ivy, james, and sophie carry their accent into other output languages. Native-accent voices like arjun (Hindi), pierre (French), and lucia (Spanish) code-switch naturally between their primary language and English.

What's next

A few things on the roadmap that builders have been asking about:

Knowledge base integration for agents that need to answer from your documentation, product specs, or support content
Better long-running tool patterns for transfers, escalations, and async jobs
Continued improvements to interruption detection as we collect more real-world conversation data

Get started

The fastest path to a working prototype is the browser quickstart — drop in your API key, open the HTML file, start talking. From there, the vibe-coding guide walks through using Claude Code, Cursor, or Windsurf to extend the agent for your use case.

Talk to a live agent to hear it in production, or grab an API key and start building.

Frequently asked questions

What is the AssemblyAI Voice Agent API?

The AssemblyAI Voice Agent API is a single WebSocket endpoint that handles the full voice agent pipeline — speech-to-text, LLM reasoning, text-to-speech, turn detection, tool calling, and session resumption — for $4.50/hr. Developers send audio frames in and receive audio frames back, with no need to wire together separate STT, LLM, and TTS providers.

How much does the Voice Agent API cost?

The Voice Agent API is $4.50 per hour, flat rate. That covers speech-to-text, LLM reasoning, text-to-speech, and turn detection in a single bill. There are no per-character TTS fees or pre-purchased concurrency packages.

What is the WebSocket endpoint for the Voice Agent API?

The Voice Agent API WebSocket endpoint is wss://agents.assemblyai.com/v1/ws for US traffic and wss://agents.eu.assemblyai.com/v1/ws for EU traffic. Authenticate with a Bearer token in the Authorization header, or pass a temporary token as ?token=... for browser-based clients.

How does the Voice Agent API handle interruptions?

Interruption handling is semantic and on by default. A multimodal model looks at both the audio and the Universal-3 Pro transcript text to decide whether the user is back-channeling ("uh-huh," "okay") or actually trying to take the floor ("wait, stop"). Backchannels don't interrupt the agent. True interruptions emit reply.done with status: "interrupted", at which point the client should flush its audio playback buffer.

Can the Voice Agent API call external tools or APIs?

Yes. Register tools in session.tools on session.update. When the agent decides to call one, the server emits a tool.call event with call_id, name, and arguments (already parsed as a dict). Run your tool, then send a tool.result event with the matching call_id and a JSON-encoded result string. Tools support both interactive mode (agent fills the wait with a transition phrase) and hold mode (agent stays silent during long-running operations like transfers).

What languages does the AssemblyAI Voice Agent API support?

The Voice Agent API understands six input languages — English, French, German, Italian, Portuguese, and Spanish — and can speak eleven output languages, adding Hindi, Japanese, Korean, Mandarin, and Russian. The agent can speak a language it can't transcribe, which supports translation-style flows where the user speaks one recognized language and the agent replies in another.

Do I need an SDK to use the Voice Agent API?

No. The Voice Agent API is a standard JSON-over-WebSocket protocol that works with any WebSocket client in any language.

How to build with the Voice Agent API

Connect to the Voice Agent API

How turn detection works

How interruption handling works

Update your agent mid-conversation

Add tool calling to your agent

Go multilingual

What's next

Get started

Frequently asked questions

What is the AssemblyAI Voice Agent API?

How much does the Voice Agent API cost?

What is the WebSocket endpoint for the Voice Agent API?

How does the Voice Agent API handle interruptions?

Can the Voice Agent API call external tools or APIs?

What languages does the AssemblyAI Voice Agent API support?

Do I need an SDK to use the Voice Agent API?

How to catch voice agent regressions before your users do

Fast ASR for voice agents: bring your own turn detection

Time to first token: the latency metric that decides voice agents

How to choose the best speech-to-text API for voice agents

Twilio phone agent with AssemblyAI Universal-3 Pro Streaming

Speech-to-Text AI for product managers: How it works and key considerations

AssemblyAI named to Fast Company’s list of Most Innovative Companies for 2025

Build a voice agent with LiveKit and AssemblyAI’s Voice Agent API

How to build with the Voice Agent API

Connect to the Voice Agent API

How turn detection works

How interruption handling works

Update your agent mid-conversation

Add tool calling to your agent

Go multilingual

What's next

Get started

Frequently asked questions

What is the AssemblyAI Voice Agent API?

How much does the Voice Agent API cost?

What is the WebSocket endpoint for the Voice Agent API?

How does the Voice Agent API handle interruptions?

Can the Voice Agent API call external tools or APIs?

What languages does the AssemblyAI Voice Agent API support?

Do I need an SDK to use the Voice Agent API?

Related posts

How to catch voice agent regressions before your users do

Fast ASR for voice agents: bring your own turn detection

Time to first token: the latency metric that decides voice agents

How to choose the best speech-to-text API for voice agents

Twilio phone agent with AssemblyAI Universal-3 Pro Streaming

Speech-to-Text AI for product managers: How it works and key considerations

AssemblyAI named to Fast Company’s list of Most Innovative Companies for 2025

Build a voice agent with LiveKit and AssemblyAI’s Voice Agent API