May 7, 2026

How to vibe code a voice agent with AssemblyAI's Voice Agent API

Exact prompts to build voice agents with Claude Code, ChatGPT, or Cursor using AssemblyAI's Voice Agent API — browser apps, phone agents, terminal tools, and more. Includes setup tips, follow-up prompts for customization, and troubleshooting for common mistakes.

Kelsey Foster

Growth

Voice Agent API

AI voice agents

Reviewed by

Table of contents

[Visible on live site]

Vibe coding a voice agent used to mean wrangling three separate APIs—one for speech-to-text, one for the LLM, one for text-to-speech—and hoping your AI coding assistant could hold the glue logic together across all three. That's changed.

AssemblyAI's Voice Agent API collapses the entire pipeline into a single WebSocket. You send mic audio in, you get spoken audio back. Speech recognition, LLM reasoning, text-to-speech, turn detection, and barge-in handling all happen server-side. The integration surface is small enough that Claude Code, ChatGPT, or Cursor can generate a working voice agent on the first prompt.

This post is a prompt guide. You'll get the exact prompts to build different types of voice agents, the setup that keeps your AI coding agent grounded in current docs, and the follow-up prompts that customize behavior without you touching the code directly. There's a companion repo with the complete output if you want to see what these prompts produce.

Set up your AI coding agent first

AI coding assistants work best when they have current context about the APIs they're integrating with. AssemblyAI's API surface changes—model names, parameters, and SDK methods evolve—so agents that rely on training data alone will sometimes produce outdated code. There are three ways to ground your agent, ranked by how reliably they keep output correct.

Project instructions (most effective)

Add this to your project's CLAUDE.md, .cursorrules, or equivalent agent instructions file:

Always fetch https://www.assemblyai.com/docs/llms.txt before writing AssemblyAI code.
The API has changed — do not rely on memorized parameter names.

This runs on every prompt. The agent will check current docs before generating any AssemblyAI code, catching breaking changes automatically.

You can also filter the full docs by language to reduce token usage:

/docs/llms-full.txt?lang=python
/docs/llms-full.txt?lang=typescript

MCP server (tool-based doc access)

Connect the AssemblyAI docs MCP server to give your agent on-demand access to search and read documentation.

‍Claude Code:

claude mcp add assemblyai-docs --transport http https://mcp.assemblyai.com/docs

Cursor (.cursor/mcp.json):

{
  "mcpServers": {
    "assemblyai-docs": {
      "url": "https://mcp.assemblyai.com/docs"
    }
  }
}

This gives your agent four tools: search_docs, get_pages, list_sections, and get_api_reference. Any MCP client that supports Streamable HTTP transport can connect using the server URL.

If your agent still relies on training data instead of looking up the docs, add this to your project instructions:

For anything AssemblyAI related, use the assemblyai-docs MCP tools first.
Do not rely on training data.

AssemblyAI skill (deep SDK context)

The AssemblyAI skill gives your AI coding assistant curated instructions and context for the Python and JavaScript SDKs, streaming, voice agents, audio intelligence, and more. It works with Claude Code, Cursor, Copilot, and 60+ other coding agents via the universal skills CLI:

npx skills add AssemblyAI/assemblyai-skill

You can also clone it directly into your skills directory. For Claude Code that's ~/.claude/skills/assemblyai-skill/.

Layer all three for best results. Project instructions catch every prompt, the MCP server provides on-demand lookups, and the skill gives deep SDK context. You don't need all three to get started—but if you're building anything beyond a proof-of-concept, using them together prevents the most common mistakes.

Why the Voice Agent API is ideal for vibe coding

Before the prompts, it helps to understand why this particular API plays so well with AI coding assistants.

A traditional voice agent needs you to wire up streaming STT, an LLM, and a TTS provider, then orchestrate audio routing between them. That's three different SDKs, three auth patterns, and audio format conversions at every boundary. AI coding assistants struggle with this because there are too many moving parts and too many ways to get the integration wrong.

The Voice Agent API is one WebSocket URL: wss://agents.assemblyai.com/v1/ws. You send 24 kHz PCM audio, you get 24 kHz PCM audio back. One API key, one auth pattern (temporary tokens for browser apps), and a handful of JSON event types. That's a small enough surface that an AI coding assistant can generate it correctly without follow-up corrections.

It also handles the hard stuff server-side—neural turn detection that knows the difference between a pause and the end of a sentence, barge-in that stops the agent immediately when the user interrupts, and tool calling that lets the agent execute real functions mid-conversation. These are features that would each take hundreds of lines to implement in a multi-service pipeline. With the Voice Agent API, they're just configuration.

Get your API key and start vibe coding

Run these prompts yourself. Get a free AssemblyAI account and have a working voice agent running in the time it takes to finish this post.

Starter prompts

These prompts work with Claude Code, ChatGPT, Cursor, and any capable AI coding assistant. Each one builds a different type of voice agent. Copy them as-is, or use them as templates.

Browser voice assistant (full-featured)

Build a browser voice assistant using AssemblyAI's Voice Agent API.

Requirements:
- Node.js Express server that mints temporary tokens at /api/voice-token
  (GET https://agents.assemblyai.com/v1/token with Bearer auth)
- Browser client that opens a WebSocket to wss://agents.assemblyai.com/v1/ws?token=...
- AudioWorklet for mic capture at 24 kHz (Int16 PCM, base64-encoded)
- Chat-style transcript UI with user/agent bubbles
- Barge-in support: flush scheduled AudioBufferSourceNodes on reply.done interrupted
- Voice picker dropdown, system prompt textarea, greeting selector
- Dark theme, polished CSS
- Use .env for the API key. Never expose the key to the browser.

This prompt produced the companion app—about 70 lines of server code and ~400 lines of client code. It's specific enough that the AI gets the architecture right on the first try: temporary tokens on the server, AudioWorklet for glitch-free mic capture, a playback cursor for scheduling reply audio chunks, and barge-in handling that flushes queued audio when the user interrupts.

Minimal proof-of-concept

Build the simplest possible browser voice agent with AssemblyAI's Voice Agent API.

Single HTML file served by a tiny Express server. The server mints tokens at
/api/voice-token, the browser opens a WebSocket with the token, sends mic audio
as base64 PCM at 24 kHz, and plays reply.audio back through AudioBufferSourceNode.
Handle session.update, session.ready, reply.audio, and transcript events. No frameworks.

Good when you want to validate the concept in 15 minutes before committing to a polished build.

Customer-support agent with tool calling

Build a customer-support voice agent for a SaaS product using AssemblyAI's
Voice Agent API. Browser-based, with a Node.js backend for token minting.

The agent should:
- Greet callers professionally
- Answer questions about pricing, features, and account billing
- Use tool calling to look up account status (mock the tool handler)
- Escalate to a human when it can't help

System prompt should scope the agent to support-only conversations.
Include turn detection tuning for patient, deliberate speech.

This exercises tool calling and custom turn detection—two features the Voice Agent API handles natively that would take significant effort to wire up across three separate services.

Phone agent with Twilio

Build a voice agent that handles inbound phone calls using AssemblyAI's
Voice Agent API and Twilio Media Streams. Python backend.

- Twilio sends audio as PCMU (mulaw) at 8 kHz via WebSocket
- Bridge Twilio's Media Stream WebSocket to wss://agents.assemblyai.com/v1/ws
- Set BOTH input and output encoding to audio/pcmu in session.update
  (otherwise the agent's reply comes back as 24 kHz PCM that Twilio can't play
  without transcoding)
- Handle call events: connected, media, stop
- System prompt: appointment scheduling agent for a dental office
- Use tool calling to check available time slots (mock data)

Telephony audio is harder than microphone audio—8 kHz sampling, compression artifacts, background noise. This prompt tells the AI exactly which encoding to use on both sides of the bridge so it doesn't default to the browser's 24 kHz PCM format and force unnecessary transcoding.

Python terminal agent (no browser)

Build a terminal-based voice agent in Python using AssemblyAI's Voice Agent API.

- Use sounddevice for mic input and speaker output at 24 kHz
- Open a WebSocket to wss://agents.assemblyai.com/v1/ws with API key in
  Authorization header (no token needed for server-side apps)
- Send input.audio with base64-encoded PCM frames from the mic
- Play reply.audio PCM frames through the speaker
- Print transcript.user and transcript.agent events to the console
- Handle barge-in: stop playback on reply.done with status interrupted
- .env for the API key

No browser, no server, no frontend. Useful for testing prompts and voice configurations quickly from the terminal.

Follow-up prompts for customization

Once the base app is running, you iterate with follow-up prompts. This is where vibe coding really shines—you describe what you want changed and the AI handles the implementation.

Change the voice

Change the default voice to "james" and add these voices to the picker:
diego (Spanish), arjun (Hindi/Hinglish), pierre (French), ren (Japanese)

The Voice Agent API offers 18 English voices and 16 multilingual voices. Multilingual voices code-switch with English automatically. See the voices catalog for samples.

Tune turn detection

Add turn detection tuning to the session config so the agent is more patient
and doesn't cut people off mid-sentence. Raise min_silence to 800ms and
max_silence to 2500ms. Leave vad_threshold at the default of 0.5.

For noisy environments, raise vad_threshold. For deliberate speech (healthcare, eldercare), raise max_silence. Set interrupt_response: false to disable barge-in entirely.

Add tool calling

Add a tool called "check_order_status" that takes an order_id string parameter.
Register it in session.tools on session.update. When the agent calls the tool,
return a mock response with order status "shipped" and tracking number
"1Z999AA10123456784". Send the tool.result on the next reply.done, not
immediately on tool.call.

Tool calling is what separates a voice agent from a voice chatbot. The Voice Agent API handles it natively—register tools with JSON Schema, and the agent calls your functions when appropriate, speaking a natural transition while it waits for results.

Add session resumption

If the WebSocket disconnects, automatically reconnect within 30 seconds using
session.resume. Fetch a new temporary token for the reconnection. Pass the
previous session_id in the session.resume event so context is preserved.

Swap the system prompt live

Add a button that sends a new session.update mid-conversation to change the
system prompt without disconnecting. Note: greeting and output.voice are
immutable after the first apply — only system_prompt, tools, and input can
change mid-session.

Try the Voice Agent API in the playground

Talk to a live voice agent in your browser before writing a single line of code. Hear the turn-taking, barge-in, and voice quality for yourself.

Try the playground

Prompting tips that prevent common mistakes

A few specific details in your prompts make the difference between code that works on the first run and code that needs debugging.

Specify the audio format. Include "24 kHz PCM, 16-bit signed, little-endian, mono, base64-encoded" in your prompt. Without this, AI coding assistants often generate code that captures audio at 44.1 or 48 kHz—which sounds chipmunky or slowed-down when the Voice Agent API interprets it at 24 kHz.

Name the auth pattern. "Temporary tokens minted server-side, passed as a query param to the WebSocket" prevents the AI from putting your API key in browser code. For server-side apps (Python, Node.js), specify "API key in Authorization header" instead—no token needed.

Mention barge-in explicitly. If you don't ask for interruption handling, most AI coding assistants skip it. Without barge-in handling, queued audio keeps playing after the user has already interrupted and moved on.

Ask for AudioWorklet by name (browser apps). The alternative is ScriptProcessorNode, which is deprecated and runs on the main thread. Specifying AudioWorklet gets you the modern, glitch-free approach.

List the event types you need handled. Including session.ready, transcript.user.delta, transcript.user, reply.audio, transcript.agent, reply.done, and session.error in your prompt means the AI covers the full event surface instead of guessing which events matter.

Tell the AI about echo cancellation. For browser apps, include "enable echoCancellation, noiseSuppression, and autoGainControl on getUserMedia." Without this, the mic picks up the agent's TTS through the speakers and the agent interrupts itself constantly.

For Twilio bridges, set encoding on both input and output. A common mistake is setting only input.format.encoding to audio/pcmu and leaving output at the default. The agent then replies in 24 kHz PCM, Twilio can't play it without transcoding, and you lose the zero-transcoding benefit of the integration. Set both to audio/pcmu.

Troubleshooting common issues

When your vibe-coded agent doesn't work on the first run, here are the most common causes and what to tell your AI coding assistant to fix them.

Problem	Cause	Follow-up prompt
Audio sounds chipmunky or slowed-down	Sample-rate mismatch—AudioContext not set to 24 kHz	"Create the AudioContext with { sampleRate: 24000 } so it matches the Voice Agent API's expected format"
Agent keeps interrupting itself	No echo cancellation—mic picks up speaker output	"Add echoCancellation: true to the getUserMedia audio constraints"
UNAUTHORIZED close on connect	Token expired, missing, or already used	"Fetch a fresh token immediately before each WebSocket connection. Tokens are single-use."
WebSocket closes with code 1006	Pre-handshake failure, usually a stale token	"Re-fetch the token if the WebSocket closes unexpectedly before session.ready"
Mic blocked error	Page not on a secure origin	"The page must be served over HTTPS or localhost. Use npx serve or Express on localhost:3000"
Queued audio plays after user interrupts	Missing barge-in handling	"On reply.done with status: 'interrupted', stop all scheduled AudioBufferSourceNodes and reset playbackTime to audioCtx.currentTime"
Twilio call has no audio on the agent side	Output encoding left at default 24 kHz PCM—Twilio can't play it without transcoding	"Set output.format.encoding to audio/pcmu in session.update so Twilio can play the reply without transcoding"

The full troubleshooting guide is in the Voice Agent API docs.

Ship a voice agent today

Sign up for a free AssemblyAI account and drop these prompts into your favorite AI coding assistant. No credit card required.

Frequently asked questions

What is AssemblyAI's Voice Agent API?

AssemblyAI's Voice Agent API is a single WebSocket endpoint that handles the entire voice agent pipeline server-side—speech-to-text with Universal-3 Pro Streaming, LLM reasoning, and text-to-speech—so you can build a conversational voice agent without wiring up separate STT, LLM, and TTS providers. It includes neural turn detection, barge-in, tool calling, session resumption, and 30+ voices out of the box. The endpoint is wss://agents.assemblyai.com/v1/ws, and it costs $4.50/hr all-in.

Can I vibe code a voice agent with Claude Code, ChatGPT, or Cursor?

Yes—and AI coding assistants do particularly well with the Voice Agent API because the integration surface is small. One WebSocket, a handful of JSON event types, and a simple auth pattern. The prompts in this guide work with Claude Code, ChatGPT, Cursor, and any capable coding assistant. For best results, set up the MCP server or project instructions so your agent pulls current docs instead of relying on training data.

How do I keep my AI coding agent up to date with AssemblyAI's API?

Three methods, ranked by effectiveness: (1) Add Always fetch https://www.assemblyai.com/docs/llms.txt before writing AssemblyAI code to your project instructions file. (2) Connect the MCP server at https://mcp.assemblyai.com/docs for on-demand doc access. (3) Install the AssemblyAI skill with npx skills add AssemblyAI/assemblyai-skill for deep SDK context. Layer all three for the best results. See AssemblyAI's coding agent prompts page for full setup instructions.

What's the most common mistake when vibe coding a voice agent?

Sample-rate mismatch. The Voice Agent API expects 24 kHz PCM audio. If your prompt doesn't specify this, AI coding assistants often generate code that captures audio at 44.1 or 48 kHz—the browser's default—which produces garbled audio. Include "24 kHz PCM, 16-bit signed, little-endian, mono, base64-encoded" in your prompt to avoid this entirely.

Do I need to understand WebSockets to vibe code a voice agent?

Not really. You need to understand the concept—a persistent two-way connection where you send audio in and get audio back—but the AI coding assistant handles the implementation details. The important thing is to include the right details in your prompt: the WebSocket URL (wss://agents.assemblyai.com/v1/ws), the auth pattern (temporary tokens for browser apps, API key header for server apps), and the event types you need handled.

How much does the Voice Agent API cost?

$4.50/hr all-in—speech recognition, LLM, and voice synthesis included. No per-token math across three separate invoices. AssemblyAI offers a free tier so you can build and test without a credit card. See the pricing page for current details.