Insights & Use Cases
June 23, 2026

Prompting Claude to build voice agents

Claude can write voice agent code in seconds. Shipping it is about context, not wording. Here are four techniques that move it from "demo that compiles" to "agent that ships."

Kelsey Foster
Growth
Reviewed by
No items found.
Table of contents

Ask Claude to build you a voice agent and you'll get working code back in about thirty seconds. That part is easy now. The gap between that first draft and something you'd actually put in front of customers — that's where the prompting craft lives.

Most of the difference doesn't come from clever wording. It comes from what you put in front of the model before you ask for anything. A coding agent is only as good as its context, and for voice agents the two things that matter most — current API docs and a real sense of how your calls actually go — are exactly the two things the model doesn't have by default.

So this isn't a list of magic prompts. It's four techniques that consistently move a voice agent from "demo that compiles" to "agent that ships":

  • Ground the agent in live docs before you prompt anything.
  • Pick your coding model deliberately — but know what actually moves the needle.
  • Keep business context out of the first prompt.
  • Feed the model real call transcripts as context.

If you're brand new to this, start with our walkthrough on how to vibe code a voice agent — it covers the basics of what comes out the other side. This post assumes you've done that once and want the version that holds up.

Ground the agent in live docs first

Here's the failure mode that wrecks more voice agent builds than any other: the model writes confident, clean, completely outdated code.
Voice APIs move fast. Endpoints get versioned, parameters get renamed, whole capabilities get added. AssemblyAI's streaming endpoint moved to wss://streaming.assemblyai.com/v3/ws a while back — but ask a coding agent working from training data alone, and there's a real chance it reaches for the retired v2/realtime URL, writes it perfectly, and hands you something that simply won't connect. The code looks right. It isn't. And you lose an hour learning that the hard way. The same goes for the newest parameters: features like agent_context and conversation carryover on Universal-3.5 Pro Realtime don't exist in any model's training data, so an ungrounded agent can't use the levers that most improve accuracy.

The fix is to stop letting the model guess. Point it at the live docs before your first real prompt. AssemblyAI publishes a docs MCP server exactly for this — it lets a coding agent query current documentation instead of recalling stale training data. In Claude Code, add it once:

claude mcp add --transport http --scope user assemblyai-docs
https://mcp.assemblyai.com/docs

There's also a skill that bundles the correct patterns and gotchas:

npx skills add AssemblyAI/assemblyai-skill --global

Now when you ask Claude to wire up streaming transcription or the Voice Agent API, it pulls the current endpoint, the current parameters, and the current auth flow. The hallucinated-endpoint class of bug mostly disappears. This one step does more for code quality than any amount of prompt tuning.

It's not Claude-specific, either. Any MCP-aware tool — Cursor, Windsurf, Codex — can connect to the same server, and prompt-based builders like Lovable or v0 can take the docs URL (https://www.assemblyai.com/docs/voice-agents/voice-agent-api) right in the prompt. Which leads to the question everyone asks next.

Point your coding agent at the docs

Grab a free API key, connect the AssemblyAI docs MCP server, and let your agent write current, correct voice AI code instead of guessing from stale training data.

Get your free API key

Claude or Codex? The model matters less than you think

Developers ask this constantly: Claude or Codex for building voice agents? It's a fair question, and the honest answer is the less satisfying one — for this task, the coding agent you pick matters less than whether it can see current docs.

Both are strong. Codex is a capable coding agent and plenty of people ship good work with it. On the Claude side, if you want the most capable model for the kind of long, multi-step, agentic build a voice agent turns into, that's Claude Opus 4.8 — run it at xhigh effort, which is the default Claude Code uses for coding and agentic work. If you care more about speed and cost on a tighter loop, Claude Sonnet 4.6 is the better trade. We won't pretend to hand you a head-to-head leaderboard here; the relevant comparison shifts every few weeks, and you should run your own task on both.

But here's the thing that doesn't shift: a voice agent's correctness depends on getting transport, turn detection, and audio formats exactly right, and none of those live in any model's training data at the version you need. Both Claude and Codex write better voice-agent code with the AssemblyAI docs MCP server attached than either does without it. The grounding is the variable that actually controls your outcome. The model is a smaller one.

For what it's worth, the Voice Agent API was built to work with Claude Code out of the box — one WebSocket, JSON in and out, no SDK to adopt — so if you're starting fresh and undecided, that's the path of least resistance. If you already live in Codex, point it at the same MCP server and keep moving. Either way, ground it first.

Keep business context out of the first prompt

This one feels backwards, so stick with me.

When people sit down to build a customer-facing agent, the instinct is to pour everything into the opening prompt: the product catalog, the refund policy, the escalation rules, the twelve edge cases sales keeps hitting. It feels thorough. It produces a mess.

The problem is that you've asked the model to solve two different problems at once — get the real-time pipeline working and encode your entire business — before either of you has confirmed the pipeline even runs. You get a sprawling first draft where the architecture and the domain logic are tangled together, and when something breaks (it will), you can't tell whether it's the WebSocket bridge or your refund rules.

Split it into two phases. The first prompt is about the engineering, and you should spec that fully and up front: the transport, the model, the latency target, barge-in handling, the STT→LLM→TTS loop. Be specific and complete here — modern models like Opus 4.8 do their best work when the technical task is well-defined in one shot.

What you leave out is the domain:

Build a phone voice agent on AssemblyAI's Voice Agent API, bridged through Twilio over a WebSocket. Target 
~1 second end-to-end latency, handle barge-in, and give me a clean way to plug in a system prompt and tools 
later. Don't add any business logic yet — I want a working loop I can call and talk to first.

That last sentence is the whole technique. You get back a skeleton you can actually dial and have a conversation with. You verify the hard part — the real-time plumbing — in isolation. Then you layer in the business context, against a foundation you already trust. Which is exactly where the next technique comes in, because the best business context isn't written from memory.

Feed Claude real call transcripts as context

Once you've got a working skeleton, the most valuable thing you can hand the model isn't a document describing how your calls should go. It's a pile of transcripts showing how they actually go.

Think about what a real call contains that a spec never will: the exact way customers say their account numbers, the product names they mangle, the three different ways people ask for a refund, where they interrupt, what they say when they're confused. That's the raw material for a system prompt that sounds right, a keyterms list that catches the entities your agent will really hear, and a tool list that matches what customers actually ask for. You can guess at all of it. Or you can read it off real calls.

You almost certainly have the recordings sitting in your contact center or call platform already. Transcribe them with AssemblyAI's pre-recorded speech-to-text API — Universal-3 Pro for the highest accuracy, with speaker labels so the agent's lines and the caller's lines stay separate:

import os
import assemblyai as aai

aai.settings.api_key = os.environ["ASSEMBLYAI_API_KEY"]

# Universal-3 Pro for accuracy; speaker_labels splits agent vs. caller.
config = aai.TranscriptionConfig(
    speech_models=["universal-3-pro"],
    speaker_labels=True,
)
transcriber = aai.Transcriber(config=config)

# Transcribe a folder of real call recordings into plain-text transcripts.
os.makedirs("transcripts", exist_ok=True)
for filename in os.listdir("recordings"):
    transcript = transcriber.transcribe(os.path.join("recordings", filename))
    if transcript.error:
        print(f"Skipped {filename}: {transcript.error}")
        continue

    # Speaker-labeled turns read the way a conversation actually flows.
    lines = [f"{u.speaker}: {u.text}" for u in transcript.utterances]
    out_path = os.path.join("transcripts", filename.rsplit(".", 1)[0] + ".txt")
    with open(out_path, "w") as f:
        f.write("\n".join(lines))
    print(f"Wrote {out_path}")

Now point Claude at the transcripts/ folder and let the real conversations do the work:

Read these 50 real support call transcripts. From them: (1) draft a system prompt for a voice agent that 
handles these calls, in the tone the reps actually use; (2) build a keyterms list of the product names, 
account formats, and domain terms that show up, so streaming transcription gets them right; (3) list the 
tools this agent would need to resolve these calls without a human.

The output is grounded in evidence instead of imagination. The keyterms list contains the SKUs your customers actually say, not the ones you remembered. The system prompt mirrors how your best reps actually talk. And because Universal-3.5 Pro Realtime lets you update keyterms mid-session — and carries conversation context from turn to turn on its own — you can keep feeding that list back into the live agent as you learn more.
This is the technique that's genuinely worth stealing. Most teams prompt their agent from a blank page. The teams whose agents sound human are reading from the transcript.

Transcribe your own calls

Run a real call recording through Universal-3 Pro in the playground and see the transcript accuracy — entities, speaker labels, and all — that your agent's context will be built on.

Try playground

The two-phase workflow, end to end

Put the four techniques in order and you get a repeatable way to build:

Phase What you prompt What you feed the model What you get
Setup AssemblyAI docs MCP server + skill An agent that writes current, correct API code
Phase 1 — skeleton The full engineering spec; no business logic Live docs A working STT→LLM→TTS loop you can call and verify
Phase 2 — ground it "Build the prompt, keyterms, and tools from these" Real call transcripts A system prompt, keyterms, and tools grounded in real calls
Tune Targeted fixes as you test Fresh transcripts of your agent's own calls An agent that keeps getting more accurate

Notice the model brand never appears in that table. Claude or Codex, Opus 4.8 or Sonnet 4.6 — the workflow is the same, and the workflow is what carries the quality.

The takeaway

The interesting shift in building voice agents isn't that AI can write the code. It's that the bottleneck moved. It used to be syntax and SDK archaeology. Now it's context — and context is something you control completely.
Stale docs and remembered business rules give you a demo. Live docs and real transcripts give you an agent. Whichever model you prompt, that's the lever. Ground it, spec the engineering cleanly, and let your own calls write the hard part.

Build your voice agent today

Get a free API key and $50 in credits, point your coding agent at the AssemblyAI docs, and ship your first voice agent on a single WebSocket — no SDK required.

Get your free API key

Frequently asked questions

Can you build a voice agent just by prompting Claude?

Yes. Claude — through Claude Code — can scaffold a complete voice agent on AssemblyAI's Voice Agent API, including the speech-to-text, LLM, and text-to-speech loop and the transport bridge. The quality depends far more on context than on wording: give it current documentation through the AssemblyAI docs MCP server and ground it in real call transcripts, rather than letting it work from training data alone. Spec the engineering task first, get a working loop, then layer in your business logic.

How do I stop an AI coding agent from writing outdated AssemblyAI code?

Connect the agent to the AssemblyAI docs MCP server so it queries current documentation instead of recalling stale training data. In Claude Code, run claude mcp add --transport http --scope user assemblyai-docs https://mcp.assemblyai.com/docs, and optionally add the skill with npx skills add AssemblyAI/assemblyai-skill --global. This eliminates the most common failure — confidently written code that uses a retired endpoint or a renamed parameter, or that misses newer capabilities like agent_context on Universal-3.5 Pro Realtime. Any MCP-aware tool, including Cursor and Codex, can use the same server.

Claude or Codex — which is better for building voice agents?

Both are capable coding agents, and you should run your own task on each rather than trust a leaderboard that shifts every few weeks. Among Claude models, Opus 4.8 is the most capable for the long, multi-step build a voice agent becomes — run it at xhigh effort, the default Claude Code uses for coding work — while Sonnet 4.6 trades some capability for speed and cost. For voice agents specifically, the bigger lever is whether the agent can see current API docs: both Claude and Codex write better code with the AssemblyAI docs MCP server attached, and the Voice Agent API works with Claude Code out of the box.

How do I use real call recordings to build a better voice agent?

Transcribe your existing call recordings with AssemblyAI's pre-recorded speech-to-text API — Universal-3 Pro with speaker labels turned on — then hand the transcripts to your coding agent as context. Real calls reveal the actual account-number formats, product names, phrasing, and turn-taking your agent will face, so the system prompt, keyterms list, and tool definitions get built from evidence instead of guesswork. Because Universal-3.5 Pro Realtime lets you update keyterms mid-session, you can keep feeding what you learn back into the live agent.

Should I include my business logic in the first prompt?

No — separate the engineering prompt from the domain prompt. Spec the full technical task (transport, model, latency target, barge-in handling) up front and get a working speech-to-text → LLM → text-to-speech loop you can call and verify first. Add the business context — system prompt, keyterms, tools — in a second phase, ideally generated from real call transcripts. Dumping everything into the opening prompt tangles the architecture with domain logic and makes it much harder to tell what broke when something does.

What do I need to prompt Claude to build a voice agent on AssemblyAI?

You need a free AssemblyAI API key, a coding agent (Claude Code is the smoothest path), and the AssemblyAI docs MCP server connected so the agent writes current code. From there, prompt for the engineering skeleton first, then ground it with transcripts of your real calls. The Voice Agent API is a single WebSocket with one flat rate and no SDK to adopt, which keeps the code the agent generates simple enough to verify quickly.

Title goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Button Text
AI voice agents
Voice Agent API