May 21, 2026

How I built a voice agent without writing (or understanding) any code

A follow-along tutorial for building a real, deployed Voice AI agent—even if you've never touched a terminal.

Devon Malloy

Staff Growth Manager

Voice Agent API

Reviewed by

Table of contents

[Visible on live site]

A follow-along tutorial for building a real, deployed Voice AI agent—even if you've never touched a terminal.

I am not a developer. I can read code well enough to feel confident about what it's doing, and I can spot when something seems wrong, but I cannot write it from scratch. I don't know what half the error messages mean on first read. I have never in my life set up a server.

And yet: I built and deployed a fully functional AI voice agent that talks F1 regulations with fans in real time—complete with a custom persona, a private knowledge base, and a web interface—and it's live at assemblyai.com/pete.

This is the story of how I did it, and a guide for how you can do the same thing with your own topic.

What we're building

A voice agent is a program that:

Listens to you through your microphone
Understands what you said
Thinks about it (optionally, searches a knowledge base)
Speaks a response back to you through your speaker

The hard parts—speech recognition, language understanding, voice synthesis—are all handled by AssemblyAI's Voice Agent API. Your job is to tell the agent who it is, what it knows, and how it should behave.

My agent? Pit Lane Pete: a retired F1 pit crew mechanic with 22 years of tire changes behind him, now unwillingly detained in a podcast booth to explain the 2026 F1 regulations to people who've never had to bolt on a front wing at 300kph.

Your agent can be anything. A customer support rep for your product. A knowledgeable docent for a museum exhibit. A study partner for medical board exams. A thought-partner for your job search. The architecture is identical.

Phase 1: Build the brain (Claude Cowork)

Before touching any code, I built two things in Claude Cowork: 1. The Knowledge Base and 2. The System Prompt. Claude Cowork is Anthropic's desktop AI tool for non-developers, it was my tool of choice for this task because Cowork lets you work with files, do research, and create documents through conversation. No coding required.

The knowledge base

A knowledge base is the source of truth your agent will search when users ask questions. For Pete, this was 847 pages of 2026 F1 regulations distilled into a single structured markdown file. For your project, this might be:

Your product documentation
A company FAQ
A textbook or manual
Any domain-specific reference you want the agent to draw from

To build yours in Claude Cowork:

Open Cowork and describe what you need. Something like: "I need to create a knowledge base for a voice agent about [your topic]. Here are the source documents..." — then attach your PDFs, paste in your text, or ask Claude to research the topic and structure the information for you.

The output should be a clean markdown file. Structured with headings, short sections, and plain language. The agent doesn't read it like a human, it searches it for relevant keywords, so clarity matters more than prose quality.

The system prompt

The system prompt is the instruction sheet your agent reads before every conversation. It defines the persona, the rules, the tone, and the constraints.

AssemblyAI has pre-built starter prompts you can use directly—they're production-tested and cover common use cases like customer support, scheduling assistants, and general Q&A. If you're not sure where to start, grab one of those and modify the persona section for your use case.

I wrote my own with help from Cowork. The conversation looked roughly like: "I'm building a voice agent with this knowledge base [attached]. Write me a system prompt that..." and then I described Pete's character in detail.

Phase 2: Actually build the app + interface (Claude Code + AssemblyAI)

Once the KB and system prompt existed, I moved into Claude Code—Anthropic's AI coding tool—to build the actual application. Claude Code runs in your terminal and can write, edit, and debug code on your behalf through conversation. You can vibe code a voice agent with just a setup prompt and a few minutes of conversation with your coding agent.

I gave Claude Code my prompt.md and knowledge-base.md files and described what I wanted to build. I told Claude Code to build a simple, clean interface that covered the most important actions a user would need to take: start a conversation, talk, and end it. That was the brief.

What to consider before you hand it off to Claude Code

Before Claude Code writes a single line, spend two minutes articulating the "shape" of your ideal voice agent: what does "done" actually look like?

Not technically. Just practically. Who uses it, and how? Is it something you share with customers? Something just for your team? Something only you use? Does it live on a webpage, or is it more like a tool you run privately? Is it always on, or do you start it when you need it?

Create a picture in your head, jot down your thoughts, and include them in your first message to Claude Code. Claude's job is to translate that picture, and all of the context you've provided, into the right approach. It will ask follow-up questions if it needs to. Think of it less like giving instructions to a contractor, and more like describing a problem to a very patient consultant who happens to know how to build everything.

The one thing worth knowing upfront—not technically, just conceptually—is that your AssemblyAI API key is like a password. It unlocks your account and bills your usage. However you describe your vision to Claude Code, mention that you want the API key kept private and protected. Claude will figure out the right way to do that given your specific setup. You don't need to know what that looks like—just name the concern, and let Claude Code solve for it.

Hosting

Claude + I decided to deploy my code to Vercel. It's free for small projects, handles serverless functions well, and deploys from a GitHub push. The Vercel API route format is simple:

// api/token.js
export default async function handler(req, res) {
  const response = await fetch('https://api.assemblyai.com/v1/realtime/token', {
    method: 'POST',
    headers: {
      'Authorization': process.env.ASSEMBLYAI_API_KEY,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({ expires_in: 300, max_session_duration: 3600 })
  });
  const data = await response.json();
  res.json({ token: data.token });
}

A vercel.json at the root routes the base URL to your HTML file:

{ "rewrites": [{ "source": "/", "destination": "/index.html" }] }

Phase 3: Gotchas and things to think about

A few things I wish someone had told me before I started — decisions to make, behaviors to expect, and how to handle it when something goes wrong.

How should the conversation end?

A voice agent session doesn't automatically know when it's "done." If nobody says anything for a while, should it hang up? Should it wait indefinitely? Should it say something after a long silence?

These are design decisions, not technical ones. Think about the experience you want: a customer support agent probably shouldn't disconnect after 10 seconds of silence — the user might be looking something up. But a quick-demo agent on a marketing page probably shouldn't run indefinitely if someone walks away from their computer.

Tell Claude Code what you want: "If the user hasn't spoken for 30 seconds, the agent should say goodbye and end the session" or "The session should stay open until the user explicitly ends it." You can also tune how sensitive the agent is to silence and background noise — whether it treats a pause as the end of a thought, or waits longer. Describe the behavior you want; Claude will handle the settings.

The agent is triggering on background noise

If your agent keeps responding to sounds that aren't speech — keyboard noise, background conversation, the TV — that's a sensitivity setting that can be dialed down. Likewise, if it keeps cutting you off mid-sentence, the silence threshold is too short. These are turn detection challenges, and they're one of the hardest problems in voice agent development.

You don't need to find and edit these settings yourself. Just describe the problem: "The agent keeps triggering when I'm not speaking. Make it less sensitive to background noise." Or: "It keeps interrupting me before I finish my sentence." Claude will know what to adjust.

Something broke. How do I fix it?

Bugs will happen. The process for dealing with them is the same regardless of what's broken: copy the error message exactly as it appears, paste it into Claude Code, and say "I'm getting this error — fix it."

That's it. Don't try to interpret the error, don't Google the error code, don't try to edit the code yourself. The error message almost always contains the information Claude needs to diagnose and fix the problem. The more of the error message you include, the faster it gets resolved.

A few things that caught me specifically: the API documentation had some field names wrong — the actual names were different from what the docs said, and Claude had to correct them by reading the error responses. A URL I was using changed silently at some point, which produced a confusing authentication error that had nothing to do with authentication. In both cases, pasting the error into Claude and asking it to fix the problem was the right move. It found the cause both times.

The docs won't always be right

API documentation is written by humans and doesn't always keep up with the product. If something that looks correct isn't working, trust the error message over the documentation. Claude Code is good at reading error responses and translating them into fixes — it's often faster to let it try something, read the error, and correct course than to carefully read the docs first.

Your API key is a password — treat it like one

Before you share your URL with anyone, make sure your API key isn't exposed. If someone finds it, they can rack up usage billed to your account. Ask Claude Code to add a check that blocks requests from anywhere other than your own site. Then verify it worked by testing it yourself. This is a five-minute addition that's easy to skip and worth not skipping.

Phase 4: Security before you ship

Before making the URL public, I found a significant hole: anyone who knew the URL could call /api/token directly and get a valid session token billed to my account. Each token allows up to an hour of usage.

The fix is an origin allowlist:

const ALLOWED_HOSTS = [
  "your-domain.vercel.app",
  "www.yourdomain.com",
  "localhost",
  "127.0.0.1",
];
const origin = req.headers.origin || req.headers.referer || "";
if (!ALLOWED_HOSTS.some((host) => origin.includes(host))) {
  return res.status(403).json({
    error: "Forbidden"
  });
}

Browsers automatically send Origin headers on cross-origin requests, so your site works normally. Direct API calls and cross-site embedding are blocked.

Verify it worked: curl https://your-app.vercel.app/api/token should return {"error":"Forbidden"}.

The honest account

A few things I'd tell myself at the start:

Claude Code does the heavy lifting, but you still have to steer. The bugs above required me to understand what was failing well enough to describe it accurately. "It doesn't work" produces worse results than "the audio goes silent after 2–3 turns and here's the console error." The more precisely you describe a problem, the faster it gets solved.

Prompt engineering is the real work. Getting Pete to sound like Pete — dry, blunt, deeply resigned, fond of the sport — took more iteration than any of the technical bugs. The system prompt went through more revisions than the code did. If you're building a voice agent for your business, invest at least as much time in the persona and the knowledge base as you do in the deployment.

Try it

Pete is live at assemblyai.com/pete. Ask him about the 2026 active aero regulations, the power unit changes, or which team he thinks made the best car this year. He'll probably answer your question and then say something slightly rude about it.

If you want to build your own, AssemblyAI's Voice Agent API documentation is where to start — they have starter prompts for common use cases you can grab and modify. Bring your own knowledge base, describe your persona, and let Claude Code handle the scaffolding.

Under the hood, the Voice Agent API is powered by Universal-3 Pro Streaming for speech-to-text, with LLM reasoning and text-to-speech handled in a single pipeline — one WebSocket connection at a flat $4.50/hr. When choosing a speech-to-text API for voice agents, that kind of simplicity matters as much as raw accuracy.

You're not the only one building this way

While I was building Pete, half the people I work with were doing the same thing. The AssemblyAI team has been running an internal build sprint — everyone building their own voice agent from scratch, shipping it, and putting it in a showcase. The results are genuinely varied and creative: an angry customer simulator for testing support agents, a clinic receptionist that books and cancels appointments, a LinkedIn post brainstormer that interviews you about your week and drafts the post in your voice.

None of these were built by people with years of voice AI experience. They were built by people who had an idea, described it to Claude, and figured it out as they went. Even handling noisy environments and fine-tuning turn detection became solvable problems once the coding agent had access to the right docs.

You can see all of them — live demos included — at assemblyai.com/showcase.

You don't need to understand the code to build something real with it. You just need to know what you want it to do.

Frequently asked questions

Do I need to know how to code to build a voice agent with AssemblyAI?

No. The Voice Agent API handles the hard technical parts — speech recognition, LLM reasoning, and text-to-speech — in a single managed pipeline. You describe what you want to a coding agent like Claude Code, and it writes the integration for you. You need to be comfortable opening a terminal and running a command, but you don't need to understand or write the code yourself.

How long does it take to build and deploy a voice agent from scratch?

The full build — knowledge base, system prompt, working app, and deployment — took me a few days, including significant time iterating on the persona. The actual technical setup (getting a working agent talking) happened in an afternoon. Most of the time went into prompt engineering and testing the agent's behavior, not writing or debugging code.

What does the Voice Agent API cost?

The Voice Agent API costs $4.50 per hour of conversation, and that flat rate covers speech-to-text (powered by Universal-3 Pro Streaming), LLM reasoning, and text-to-speech together. There are no per-token surcharges or hidden fees for individual pipeline components. AssemblyAI offers a free tier so you can build and test without a credit card.

What coding agent should I use to build a voice agent?

Claude Code, Cursor, GitHub Copilot, and Windsurf all work. Claude Code is particularly well-suited because of how it handles multi-file project generation and iterative debugging through conversation. The choice matters less than the setup — whichever agent you use, give it the AssemblyAI docs context so it generates current, working code.

How do I stop the agent from cutting me off or responding to background noise?

These are turn detection and voice activity detection settings that can be tuned. You don't need to find the settings yourself — describe the problem to your coding agent ("the agent keeps interrupting me" or "it triggers on keyboard noise") and it will adjust the silence thresholds and VAD sensitivity for you.

Can I build a voice agent for any topic or use case?

Yes. The architecture is the same regardless of the domain — customer support, education, healthcare intake, sales training, personal assistants, or anything else. What changes is the knowledge base (your source material) and the system prompt (the persona and rules). The Voice Agent API handles the voice pipeline; your job is defining what the agent knows and how it behaves.