May 19, 2026

Build a real-time voice AI agent in Python with the AssemblyAI Voice Agent API

Build a real-time voice agent in Python in under 100 lines of code using AssemblyAI's Voice Agent API — one WebSocket that handles STT, LLM, TTS, turn detection, and tool calling at $4.50/hour flat. Full code, latency tuning guide, and production deployment tips included.

Kelsey Foster

Growth

AI voice agents

Voice Agent API

Reviewed by

Table of contents

[Visible on live site]

You can build a working real-time voice agent in Python in well under 100 lines of code if you use the right primitive. This tutorial walks through building one on the AssemblyAI Voice Agent API — a single WebSocket that wraps streaming speech-to-text, an LLM, text-to-speech, turn detection, and tool calling at $4.50/hour flat. No three-provider pipeline to wire up, no separate STT WebSocket plus LLM HTTP plus TTS stream to coordinate. Audio in, audio out, tool calls in between.

By the end of this guide, you'll have a runnable Python voice agent that listens through your microphone, holds a real conversation, and calls Python functions to take actions. The companion repository is linked at the end. If you'd rather chain streaming STT, an LLM, and a TTS provider yourself, our Python voice agent tutorial covers that path, or see the 5-minute Voice Agent API quickstart for an even faster path.

Why use the Voice Agent API for a Python voice agent

The traditional "voice agent in Python" tutorial wires together a streaming STT API, an LLM HTTP endpoint, and a TTS streaming connection — three providers, three sets of credentials, three sets of latency variables to tune, and your own turn detection logic to write. The result works, but it's a lot of plumbing.

The Voice Agent API replaces all of that with one WebSocket. You connect once, send audio frames, and receive both audio output and tool call events on the same stream. Three properties make it useful for production Python voice agents:

One bill, one set of logs. $4.50/hour of session time covers STT, LLM inference, TTS, turn detection, and tool calling. You're not pasting three invoices into a cost spreadsheet.
Speech accuracy that works on real audio. Universal-3 Pro Streaming sits underneath — 307ms P50 latency, immutable transcripts, native 8kHz mulaw support for telephony, and 21% fewer alphanumeric errors than the previous generation of streaming STT.
Tool calling that maps to Python functions cleanly. Define tools as JSON schemas, the LLM calls them, results stream back into the conversation. No separate function-calling API or LLM provider to manage.

Architecture

  Microphone
     │  PCM16 24kHz mono
     ▼
  Your Python script
     │  WebSocket: input.audio frames
     ▼
  AssemblyAI Voice Agent API
   ┌────────────────────────────────┐
   │  STT + Turn detection           │
   │      ↓                          │
   │  LLM + tool calling             │
   │      ↓                          │
   │  TTS                            │
   └────────────────────────────────┘
     │
     │  WebSocket: reply.audio + tool.call events
     ▼
  Your Python script
     ├─► Speaker playback
     └─► Dispatch tool calls back to LLM

Audio flows in two directions on the same WebSocket. Your script captures mic audio, base64-encodes it, and sends it as input.audio events. The API returns audio playback chunks as reply.audio events and structured tool.call events when the LLM decides to invoke one of your tools. You dispatch the tool, send back a tool.result, and the conversation continues.

Before you start

You'll need:

An AssemblyAI account with Voice Agent API access
Python 3.11+
A working microphone and speakers (use headphones for clean barge-in — desktop mics pick up the agent's own voice and cause it to interrupt itself)
portaudio installed system-wide (brew install portaudio on macOS, apt install portaudio19-dev on Debian/Ubuntu)

Install the dependencies:

pip install "websockets>=14" python-dotenv pyaudio

Drop your API key into a .env file:

ASSEMBLYAI_API_KEY=your_key_here

Step 1: Capture microphone audio

PyAudio captures raw PCM audio. The Voice Agent API's default audio/pcm encoding is 24 kHz, 16-bit, mono — the audio format docs recommend ~50 ms chunks for low latency.

# audio.py
import threading
from queue import Queue
import pyaudio

SAMPLE_RATE = 24000
CHUNK_SIZE = 1200  # 50ms at 24kHz 16-bit mono

class Mic:
    def __init__(self):
        self._pa = pyaudio.PyAudio()
        self.queue = Queue()
        self._running = False

    def start(self):
        self._running = True
        self._stream = self._pa.open(
            format=pyaudio.paInt16, channels=1, rate=SAMPLE_RATE,
            input=True, frames_per_buffer=CHUNK_SIZE,
        )
        threading.Thread(target=self._loop, daemon=True).start()

    def _loop(self):
        while self._running:
            self.queue.put(self._stream.read(CHUNK_SIZE, 
exception_on_overflow=False))

    def stop(self):
        self._running = False
        self._stream.stop_stream(); self._stream.close()
        self._pa.terminate()

class Speaker:
    def __init__(self):
        self._pa = pyaudio.PyAudio()
        self._stream = self._open()

    def _open(self):
        return self._pa.open(
            format=pyaudio.paInt16, channels=1, rate=SAMPLE_RATE, output=True,
        )

    def play(self, audio_bytes):
        self._stream.write(audio_bytes)

    def flush_and_restart(self):
        # Called on barge-in: drop any queued speech and reopen the stream.
        try:
            self._stream.stop_stream(); self._stream.close()
        except Exception:
            pass
        self._stream = self._open()

    def close(self):
        self._stream.stop_stream(); self._stream.close()
        self._pa.terminate()

Step 2: Open the Voice Agent API session

The Voice Agent API connection starts with a session.update message that declares your system prompt, the tools you want available, the agent's voice, and an opening greeting. The API picks audio/pcm (24 kHz) by default, so you don't need to specify input/output format explicitly.

# agent.py
import asyncio, base64, json, os
import websockets
from dotenv import load_dotenv

from audio import Mic, Speaker
from tools import TOOLS, dispatch_tool

load_dotenv()

VOICE_AGENT_WS = "wss://agents.assemblyai.com/v1/ws"

SYSTEM_PROMPT = """You are a helpful voice assistant.
Keep replies short and conversational — one or two sentences.
Use the available tools to answer questions when relevant."""

async def open_session(ws):
    await ws.send(json.dumps({
        "type": "session.update",
        "session": {
            "system_prompt": SYSTEM_PROMPT,
            "greeting": "Hi! How can I help?",
            "tools": TOOLS,
            "output": {"voice": "ivy"},
        },
    }))

A few details worth flagging up front, because they're the easy ones to get wrong:

The auth header for the Voice Agent API uses Authorization: Bearer YOUR_KEY — note the Bearer prefix. This is different from every other AssemblyAI endpoint, which accepts the raw API key with no prefix.
The first message you send is session.update, not session.start. All config nests under a session object.
The voice field is a named voice from the Voice Agent API catalog (e.g. ivy, james, sophie) — not an ElevenLabs voice ID. See the voices reference for the full list.
You must wait for the server's session.ready event before sending any audio.

Step 3: Pump audio in, route events out

Two coroutines run concurrently: one sends mic chunks once the session is ready, the other reads events as they arrive.

async def run_agent():
    mic = Mic()
    speaker = Speaker()

    async with websockets.connect(
        VOICE_AGENT_WS,
        additional_headers={"Authorization": f"Bearer 
{os.environ['ASSEMBLYAI_API_KEY']}"},
    ) as ws:
        await open_session(ws)

        ready = asyncio.Event()
        pending_tools = []
        loop = asyncio.get_event_loop()

        async def send_audio():
            await ready.wait()
            mic.start()
            while True:
                chunk = await loop.run_in_executor(None, mic.queue.get)
                await ws.send(json.dumps({
                    "type": "input.audio",
                    "audio": base64.b64encode(chunk).decode(),
                }))

        async def receive():
            async for raw in ws:
                event = json.loads(raw)
                kind = event["type"]

                if kind == "session.ready":
                    ready.set()
                    print(f"Session ready: {event.get('session_id')}")

                elif kind == "reply.audio":
                    speaker.play(base64.b64decode(event["data"]))

                elif kind == "tool.call":
                    # Accumulate — flush on reply.done, not now.
                    result = dispatch_tool(event["name"], event.get("arguments",
{}))
                    pending_tools.append({"call_id": event["call_id"], "result":
result})

                elif kind == "reply.done":
                    if event.get("status") == "interrupted":
                        pending_tools.clear()
                        speaker.flush_and_restart()
                    elif pending_tools:
                        for tool in pending_tools:
                            value = tool["result"]
                            if not isinstance(value, str):
                                value = json.dumps(value)
                            await ws.send(json.dumps({
                                "type": "tool.result",
                                "call_id": tool["call_id"],
                                "result": value,
                            }))
                        pending_tools.clear()

                elif kind == "transcript.user":
                    print(f"You:   {event['text']}")

                elif kind == "transcript.agent":
                    print(f"Agent: {event['text']}")

        await asyncio.gather(send_audio(), receive())

That's the entire voice agent loop. The Voice Agent API handles every layer of the pipeline (STT, LLM, TTS, turn detection) inside the WebSocket. Your job is to feed it audio, play what comes back, and dispatch tool calls.

Two more easy-to-miss details:

Tool result timing. Per the tool calling docs, accumulate tool results when tool.call fires and send them inside the reply.done handler — not immediately. The agent generates a short transition phrase ("let me check on that") while the tools run; sending results too early can cause timing issues.
Interruption handling. When the user barges in, the server sends reply.done with status: "interrupted". Drop any queued tool results and flush the speaker so the caller doesn't keep hearing the previous reply.

Step 4: Implement the tools

The dispatch_tool function is where your agent does real work. The Voice Agent API delivers tool.call events with arguments already parsed as a Python dict — no json.loads() needed.

# tools.py
TOOLS = [
    {
        "type": "function",
        "name": "get_weather",
        "description": "Get the current weather for a city.",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
    {
        "type": "function",
        "name": "remember",
        "description": "Save something the user wants you to remember.",
        "parameters": {
            "type": "object",
            "properties": {"fact": {"type": "string"}},
            "required": ["fact"],
        },
    },
]

_memory = []

def dispatch_tool(name, args):
    if name == "get_weather":
        # In production: call a real weather API.
        return f"It's 68°F and partly cloudy in {args['city']}."
    if name == "remember":
        _memory.append(args["fact"])
        return f"Got it. I'll remember: {args['fact']}"
    return f"Unknown tool: {name}"

The "type": "function" field on each tool is required. Forget it and the API will reject the session.update with a validation error.

In production, replace the stubs with calls to a real weather API, your CRM, a database, or whatever your application actually does. The tool dispatcher is pure Python — anything you can do from a Python function, the voice agent can do.

Step 5: Run it

python agent.py

The agent greets you. Try:

"What's the weather in San Francisco?"
"Remember that my passport expires in March."
"What did I just ask you to remember?"

The full flow: your speech → STT → LLM (with tools available) → tool call (if applicable) → tool result → LLM continues → TTS → speaker. All in under a second, on one WebSocket.

Latency: getting under 500ms perceived

A natural-feeling voice agent responds in under 800ms from when you stop talking to when you hear the reply. Best-in-class teams target sub-500ms. Where your milliseconds go on the Voice Agent API:

Stage	Typical latency
Mic chunk → server	~50–100ms
End-of-turn detection	~100–200ms
LLM first-token	~200–400ms
TTS first-byte → speaker	~100–250ms
Perceived total	~450–950ms

The Voice Agent API streams audio output as it's generated, so the user hears the first word of the reply while the rest is still being synthesized. The biggest latency wins on the Python side:

Don't buffer mic audio. Send 50ms chunks as they arrive — that's what the audio.py example does.
Don't block in the tool dispatcher. If a tool call takes more than 500ms, the silence becomes audible. Cache hot data, set aggressive timeouts, and consider returning a placeholder ("Let me check on that") while the real call resolves.
Use the streaming audio output. Play reply.audio chunks as they arrive; never wait for the full response.

Handling interruptions

Real conversations include interruptions. The user changes their mind, asks a follow-up while the agent is still talking, or says "wait, no, the other one." The Voice Agent API handles this server-side: barge-in is semantic — back-channels like "uh-huh" don't trigger an interruption, but "wait, stop" does.

When the user actually interrupts, the server sends reply.done with status: "interrupted" (and transcript.agent with interrupted: true and the trimmed text). Your client should flush any queued speaker audio and drop any pending tool results, exactly as shown in the receive() loop above.

Going to production

The agent above runs against your local microphone. To deploy it, swap the audio transport:

Phone calls (PSTN) — Bridge through Twilio Media Streams. The Voice Agent API supports audio/pcmu (G.711 μ-law at 8 kHz) natively, so phone audio stays in μ-law end-to-end with no resampling. See our our LiveKit voice agent guide if you'd rather use an orchestrator.
Web apps — Capture audio in the browser with AudioWorklet, then stream it to the Voice Agent API. See Browser integration for the temporary-token flow that keeps your API key off the client.
Mobile — Same pattern. The native audio capture APIs (iOS AVAudioEngine, Android AudioRecord) emit PCM you can forward through your server.

For all production deployments, add:

Session persistence (save the session_id from session.ready and use session.resume to reconnect within 30 seconds without losing context)
Per-session structured logs (user transcript, agent transcript, tool calls, tool results)
PII redaction on transcripts before they hit your warehouse
A timeout-and-retry policy for tool calls so a slow backend doesn't kill the call

The complete repository

Fork the runnable Python repo at github.com/kelsey-aai/python-voice-agent-api. It includes mic capture, speaker playback, the WebSocket loop, the tool dispatcher, and example tools you can swap for your own. Around 200 lines of Python end-to-end.

Frequently asked questions

How do I build a real-time voice agent in Python?

The fastest way to build a real-time voice agent in Python in 2026 is to open a WebSocket to the AssemblyAI Voice Agent API at wss://agents.assemblyai.com/v1/ws, stream microphone audio in as input.audio events, and play the reply.audio events you get back. The Voice Agent API handles streaming speech-to-text, the LLM, text-to-speech, turn detection, and tool calling on a single connection at $4.50/hour, so you don't need to wire up three separate providers. With PyAudio for microphone access and the websockets library, the entire agent fits in well under 100 lines of Python.

What's the difference between the Voice Agent API and chaining STT-LLM-TTS in Python?

The chained approach uses three providers: a streaming STT API like AssemblyAI Universal-3 Pro Streaming, an LLM like GPT-4o, and a streaming TTS like ElevenLabs. You write the WebSocket bridge, turn detection logic, and retry handling yourself. The Voice Agent API replaces all of that with a single WebSocket — one provider, one bill, one set of logs. Chained pipelines give you finer control over each layer; the Voice Agent API is faster to ship and easier to operate at scale.

How do I add tool calling to a Python voice agent?

Define tools as JSON schemas in the tools field of your session.update message — each tool needs "type": "function", a name, a description, and a parameter schema. When the LLM decides to call a tool, the Voice Agent API emits a tool.call event on the WebSocket with the tool name, arguments (as a Python dict), and a call_id. Your Python dispatcher runs the actual function, then you send back a tool.result event with that call_id and the result. Send tool results inside your reply.done handler, not immediately on tool.call — the agent speaks a transition phrase while the tools run.

How low can latency go on a Python voice agent?

A well-tuned Python voice agent on the Voice Agent API typically lands at 450–950ms perceived latency from end-of-turn to first audio out. The biggest wins are: (1) keep mic chunks small (~50ms) so end-of-turn detection fires fast, (2) don't block in your tool dispatcher — cache and timeout aggressively, and (3) play reply.audio chunks as they arrive instead of buffering. Universal-3 Pro Streaming alone hits 307ms P50 for transcription, which is the floor for the STT layer.

Can I use a different LLM with the Voice Agent API?

The Voice Agent API ships with frontier-quality LLMs under the hood, selected for low-latency conversational performance. If you specifically need a model that isn't available through the Voice Agent API, you can fall back to a chained architecture where you use AssemblyAI Universal-3 Pro Streaming for the STT layer and bring your own LLM and TTS. Most teams find the Voice Agent API model selection meets their needs and prefer the simpler architecture.

How do I handle interruptions in a Python voice agent?

The Voice Agent API detects barge-in semantically: back-channels like "uh-huh" don't interrupt, but "wait, stop" does. When the user actually interrupts, the server emits reply.done with status: "interrupted" and transcript.agent with interrupted: true. Your Python client should flush the speaker buffer (close and reopen the PyAudio output stream, or use sounddevice.abort()), drop any pending tool results, and continue listening for the user's new turn. This is what makes interruptions feel natural — the agent stops talking immediately instead of waiting for the previous reply to finish.