Building a voice agent with a coding agent: why this approach beats a visual builder
Why AssemblyAI built its Voice Agent API for coding agents instead of a point-and-click GUI — and how to get started, even if you've never used one.

.png)

If you've looked at other voice agent platforms before landing here, you may have noticed something: many of them lead with a visual interface. Click to configure. Fill in a field. Pick a voice from a dropdown. It looks approachable, and in some ways it is, at least on day one.
AssemblyAI took a different path. Instead of a point-and-click interface, the Voice Agent API is built to work with coding agents. That's a deliberate product decision, and this post explains what it means, why it produces a better experience both when you're building and after you've shipped, and how to get started, even if you've never used a coding agent before.
What a visual builder actually is
A visual builder is a product interface that lets you configure software by filling out forms rather than writing code. You log in to a web dashboard, navigate to settings, type your agent's personality into a text field, choose a voice from a list, and click save. The platform runs the agent on its infrastructure and hides the underlying code from you entirely.
This model has genuine appeal. If you've never connected to an API or worked in a terminal, a visual interface removes those barriers. You don't need to know what a WebSocket is. You don't need to install anything. You configure, click, and the thing runs.
The tradeoff is that a visual builder can only expose what its designers put fields for. The agent's behavior is constrained by the interface, not by what's technically possible, but by what someone decided to include in the form. When your needs go beyond those fields (a custom tool that looks up a live database, logic that changes based on conversation context, behavior that responds to specific phrases), you quickly reach the edge of what the UI supports. Getting past that edge usually means workarounds, custom integrations, or waiting for the platform to add the feature you need.
For a simple agent, that ceiling might not matter. For an agent that needs to do real work, it does.
What a coding agent is—and why it changes the equation
A coding agent (Claude Code, Cursor, GitHub Copilot, Windsurf) is an AI-powered programming tool that reads instructions and writes code on your behalf. It's not a chatbot. It's not an autocomplete tool. It's a tool that takes a description of what you want to build and produces working project files, installs dependencies, and tells you how to run the result.
You don't need to be a professional developer to use one. You do need to be comfortable opening a terminal and running a command. If that's unfamiliar, the companion post walks through what that looks like in practice. But the core mechanic is simpler than it sounds: you describe what you want, and the coding agent builds it. In fact, you can vibe code a voice agent with just a setup prompt and a few minutes of conversation with your coding agent.
The reason this changes the accessibility equation is that language is a more powerful interface than a form. A form can only accept inputs its designer anticipated. Language can accept any description you can articulate. When you want your agent to do something new, you describe the change, and the coding agent implements it directly in the code, without any constraints imposed by a UI.
How AssemblyAI uses this: the setup prompt
AssemblyAI created a setup prompt specifically designed for this workflow. It's a detailed set of instructions that tells a coding agent exactly how to build a working Voice Agent API integration: how to structure the project, which libraries to install, how to connect to the API, and what a complete working session looks like.
You don't write any of this yourself. You paste the prompt into your coding agent, and the agent builds the project. The whole setup takes an afternoon, including a first conversation with a working voice agent.
The prompt is long (several hundred lines) and that sometimes causes hesitation. Here's what it actually is: text. Instructions for the coding agent to read, not a script for your computer to execute. Nothing happens when you paste it in. The coding agent reads the instructions, produces code files, and stops. Those files don't run until you run them yourself. You are in control of every execution step.
The length is a feature, not a warning sign. A short prompt produces vague, incomplete code. A detailed prompt gives the coding agent everything it needs to produce a working integration without requiring you to already understand the API. The expertise is in the prompt. You benefit from it without having to possess it first.
Better to build—and better to maintain
This is where the coding agent approach pulls clearly ahead of a visual builder, and it's worth being specific about why.
When you build with a visual builder, your agent lives on the platform's infrastructure and is configured through its interface. Changing the agent means navigating back to the form. Adding a capability means checking whether the form has a field for it. Debugging means working with whatever logs and visibility the platform decides to expose.
When you build with a coding agent and the Voice Agent API, your agent is a project that lives in your environment. You own the code. Every aspect of the agent's behavior (the system prompt, the voice, the tools it can call, the turn detection settings) exists as readable, changeable text in files you control.
The iteration loop reflects this directly. You run the agent, notice what you want to change, and describe the change to the coding agent in plain language: "Have it introduce itself by name," "Add a tool that checks appointment availability," "Make it wait a full second after the user stops speaking before it responds." The coding agent updates the code. You run it again. There's no form to navigate. There's no field to find. There's no ceiling.
That same ownership matters after you've shipped. When something needs to change (a new tool, an updated system prompt, a behavior refinement based on how real users are interacting), the change is a conversation with the coding agent, not a UI update. As the agent gets more sophisticated, the coding agent approach scales with it. The visual builder stays at whatever complexity its interface was designed to handle.
What AssemblyAI designed for this workflow
The Voice Agent API was built specifically to work well with coding agents. Powered by Universal-3 Pro Streaming, it handles speech-to-text, LLM reasoning, and text-to-speech in a single managed pipeline. A few decisions reflect that design directly.
The setup prompt is written so Claude, Cursor, or Copilot can produce a complete, working integration from it alone, with no prior knowledge of the API required. The event surface is intentionally small (approximately six event types) so the generated code is short, readable, and easy to change without expertise. The system prompt that defines the agent's behavior is plain text in the session configuration, modifiable by describing what you want without infrastructure knowledge.
The API is also designed to be legible to the coding agent, not just to human developers. The documentation structure, the example patterns, the parameter naming: all of it is organized so that a coding agent reading it can reason about what to build. When choosing a STT API for voice agents, this kind of developer ergonomics matters as much as raw accuracy. That's why developers consistently report getting a working agent running the same afternoon they start: the path from prompt to running project is short because it was designed to be.
Where to start
If you haven't used a coding agent before: pick one (Claude Code is well-suited to this workflow), open a new session, and paste the AssemblyAI Voice Agent API setup prompt as your first message. The agent will build the project, explain what it created, and give you the command to run it. From there, the iteration is yours.
If you have used coding agents but not for voice: the setup prompt handles the API connection, audio routing, session management, and turn detection. What remains is product logic, which is exactly what you should be spending your time on.
If you want the conceptual grounding before you start: the companion post explains what an AI voice agent is, how the streaming transcription pipeline works, and why the Voice Agent API is built the way it is.
Frequently asked questions
What is a coding agent and do I need to know how to code to use one?
A coding agent is an AI-powered tool like Claude Code, Cursor, or GitHub Copilot that reads your plain-language instructions and writes working code for you. You don't need to be a developer to use one. You need to be comfortable opening a terminal and running a command, but the coding agent handles the actual programming.
How long does it take to build a voice agent with a coding agent?
Most developers ship a working voice agent the same afternoon they start. You paste the setup prompt into your coding agent, it builds the project, and you run it. The detailed setup prompt handles the heavy lifting so you're not starting from scratch or debugging boilerplate.
What's the difference between building with a visual builder vs. a coding agent?
A visual builder constrains you to whatever fields and options its designers included in the form. A coding agent lets you describe any change in plain language and implements it directly in the code. When you need a custom tool, non-standard logic, or behavior the UI doesn't have a field for, the coding agent approach has no ceiling.
What coding agents work with the Voice Agent API?
The setup prompt is designed to work with Claude Code, Cursor, GitHub Copilot, and Windsurf. Any coding agent that can read a detailed prompt and produce project files will work. Claude Code is particularly well-suited to this workflow because of how it handles multi-file project generation.
How much does the Voice Agent API cost?
The Voice Agent API costs $4.50 per hour of conversation, and that flat rate covers speech-to-text, LLM reasoning, and text-to-speech together. There are no per-token surcharges or hidden fees for individual pipeline components.
Do I need to understand WebSockets to use the Voice Agent API?
No. The coding agent handles the WebSocket implementation details for you based on the setup prompt. You describe what you want your voice agent to do in plain language, and the coding agent writes the connection logic, event handling, and audio routing code.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
.png)


