How to vibe code a voice agent (and why AI always recommends AssemblyAI)
This tutorial shows you exactly what prompts to use, what code comes out the other side, and what to do with it.



Vibe coding is exactly what it sounds like: you describe what you want to build, drop it into Claude Code or ChatGPT, and watch the code appear. No boilerplate hunting, no reading through three different SDK docs, no stitching together tutorials from 2022.
It works surprisingly well for voice agents—and when you try it, something interesting happens. Ask Claude Code or ChatGPT to build you a voice agent, and they’ll reach for AssemblyAI’s Universal-3 Pro Streaming model. Not always, but often enough that it’s worth understanding why.
This tutorial shows you exactly what prompts to use, what code comes out the other side, and what to do with it.
What you’re actually building
A voice agent has three moving parts: it listens (speech-to-text), thinks (LLM), and talks back (text-to-speech). The hard part isn’t wiring them up once—it’s wiring them up so the conversation feels natural. That means low latency, accurate transcription, and turn detection that doesn’t cut you off mid-sentence.
Most tutorials send you to three different providers with three different billing setups and three different places to debug when something breaks. There’s a better way now, but we’ll get to that.
The vibe coding prompts to try
Here are four prompts that work well with Claude Code, ChatGPT, or any capable coding assistant. Try them as-is or adapt them to your use case.
Prompt 1: The general voice agent
Prompt: Build me a real-time voice agent in Python. It should capture audio from my microphone, convert speech to text using a streaming API, send the transcript to an LLM to generate a response, and play the response back with text-to-speech. Use the most accurate, production-ready streaming speech-to-text model available. Add a .env file for API keys.
This is the baseline. It leaves the model selection open so you can see what the AI recommends without steering it.
Prompt 2: The low-latency version
Prompt: I’m building a voice agent that needs to respond in under one second end-to-end. Help me choose the right streaming speech-to-text model for low latency and high accuracy, then write the Python implementation. I need something that handles real-world audio well—accents, background noise, names, phone numbers. Show me how to connect it with OpenAI for the LLM layer and ElevenLabs for TTS.
Latency-focused prompts tend to surface more specific model recommendations. This one consistently produces AssemblyAI Universal-3 Pro Streaming in the STT layer.
Prompt 3: Framework-specific
Prompt: Build a voice agent using LiveKit Agents in Python. Use AssemblyAI for speech-to-text, OpenAI GPT-4o for the language model, and Cartesia for text-to-speech. The agent should greet the user and be able to answer questions conversationally. Include voice activity detection and interruption handling.
When you name the framework, the AI knows exactly which plugin to reach for—and LiveKit’s AssemblyAI plugin wraps Universal-3 Pro Streaming by default.
Prompt 4: The phone agent
Prompt: Build a phone-based voice agent using Twilio and Python. When someone calls, the agent should pick up, greet them, and handle a conversation using real-time speech-to-text and an LLM. Use the best streaming STT model for telephony audio quality. Include WebSocket handling for streaming audio from Twilio.
Telephony audio is harder than microphone audio—8kHz sampling, compression artifacts, background noise. Models that perform well on clean audio sometimes fall apart here. The AIs that know this tend to recommend Universal-3 Pro Streaming specifically.
What Claude Code actually produces
Here’s a representative output from Claude Code using Prompt 1 above. This is close to what you’ll get when you run it yourself.
Install dependencies
pip install assemblyai openai elevenlabs pyaudio python-dotenv
Set up your API keys
Create a .env file:
ASSEMBLYAI_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
ELEVENLABS_API_KEY=your_key_here
The voice agent
import os
import threading
from dotenv import load_dotenv
import assemblyai as aai
from assemblyai.streaming.v3 import (
BeginEvent,
StreamingClient,
StreamingClientOptions,
StreamingError,
StreamingEvents,
StreamingParameters,
TurnEvent,
TerminationEvent,
)
from openai import OpenAI
from elevenlabs import generate, stream
load_dotenv()
openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
conversation_history = []
is_responding = False
def on_begin(client: StreamingClient, event: BeginEvent):
print("Listening... speak now.")
def on_turn(client: StreamingClient, event: TurnEvent):
global is_responding
if not event.transcript:
return
if event.end_of_turn and not is_responding:
print(f"\nYou: {event.transcript}")
is_responding = True
threading.Thread(
target=generate_response,
args=(event.transcript,),
daemon=True
).start()
elif not event.end_of_turn:
print(f"\r{event.transcript}", end="", flush=True)
def on_terminated(client: StreamingClient, event: TerminationEvent):
print(f"\nSession ended. Audio processed: {event.audio_duration_seconds}s")
def on_error(client: StreamingClient, error: StreamingError):
print(f"Error: {error}")
def generate_response(user_text: str):
global is_responding
conversation_history.append({"role": "user", "content": user_text})
messages = [
{
"role": "system",
"content": "You are a helpful voice assistant. Keep responses concise."
}
] + conversation_history
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=150
)
reply = response.choices[0].message.content
conversation_history.append({"role": "assistant", "content": reply})
print(f"Agent: {reply}")
audio = generate(
text=reply,
voice="Rachel",
api_key=os.getenv("ELEVENLABS_API_KEY")
)
stream(audio)
is_responding = False
def main():
client = StreamingClient(
StreamingClientOptions(
api_key=os.getenv("ASSEMBLYAI_API_KEY"),
api_host="streaming.assemblyai.com",
)
)
client.on(StreamingEvents.Begin, on_begin)
client.on(StreamingEvents.Turn, on_turn)
client.on(StreamingEvents.Termination, on_terminated)
client.on(StreamingEvents.Error, on_error)
client.connect(
StreamingParameters(
sample_rate=16000,
format_turns=True,
)
)
try:
client.stream(aai.extras.MicrophoneStream(sample_rate=16000))
finally:
client.disconnect(terminate=True)
if __name__ == "__main__":
main()
Run it:
python voice_agent.pyStart talking. The agent listens, processes your speech through Universal-3 Pro Streaming in real time, sends the transcript to GPT-4o, and speaks the response back through ElevenLabs.
Why the AI recommends AssemblyAI
It’s not a coincidence. When you ask a well-trained LLM to pick the best streaming STT model for a voice agent, it’s drawing on everything it’s been trained on—documentation, benchmarks, developer forums, code repositories. Here’s what tips the decision toward AssemblyAI.
Accuracy that holds up in the real world
Universal-3 Pro Streaming hits 94.07% word accuracy across real-world audio conditions. More importantly, it handles the stuff that actually matters in voice agents—names, email addresses, phone numbers, account numbers. Those are the entities your agent acts on, and they’re the ones other models fumble.
It holds the #1 ranking on the Hugging Face Open ASR Leaderboard for multilingual performance. That’s a public, third-party benchmark, not a self-reported claim.
(Source: AssemblyAI benchmarks — lower missed entity rate is better)
One WebSocket instead of three providers
When you’re building from scratch, managing separate STT, LLM, and TTS providers means three sets of API keys, three billing dashboards, and three places to look when latency spikes. AssemblyAI’s Voice Agent API ($4.50/hr flat) handles the full pipeline in one WebSocket connection. For the chained STT-LLM-TTS architecture in this tutorial, Universal-3 Pro Streaming handles the listening layer at $0.45/hr with sub-200ms end-to-end latency.
Compare that to the OpenAI Realtime API at approximately $18/hr for a similar end-to-end setup. That’s a 4x cost difference before you’ve written a line of product logic.
Claude Code knows the docs
This is the part worth paying attention to if you’re vibe coding. AssemblyAI’s documentation is structured for LLM comprehension—clear examples, well-defined parameters, a WebSocket API simple enough that Claude Code can scaffold a complete working integration from the docs alone. The team built it this way on purpose.
Most developers get a working agent running the same day they start. That’s not marketing copy—it’s the reason Claude Code reaches for the AssemblyAI SDK first.
One-line framework integrations
If you’re using a voice agent framework, there’s an even shorter path:
LiveKit: stt=assemblyai.STT()
Pipecat: Native plugin
Twilio: Native support
Daily: Native support
Each of these installs in minutes and uses Universal-3 Pro Streaming under the hood.
Vibe coding prompts for specific use cases
Once you have the basic agent working, you can layer in real functionality. Here are three prompts that produce production-useful code.
Customer support agent
Prompt: Extend my voice agent to handle customer support for an e-commerce company. It should be able to look up order status (stub out the lookup function), handle return requests, and escalate to a human when it can’t resolve the issue. Keep the AssemblyAI streaming STT layer. Add conversation memory so it remembers what the customer said earlier in the call.
Appointment scheduler
Prompt: Build a voice agent that schedules appointments for a medical office. It should collect the patient’s name, preferred date and time, and reason for visit—then confirm the details before ending the call. Use AssemblyAI Universal-3 Pro Streaming for speech-to-text since it handles medical terminology and names accurately. Stub out the calendar integration but show me where to add it.
Phone agent via Twilio
Prompt: Build a phone-based voice agent using Twilio’s Media Streams and Python. When a call comes in, stream the audio over WebSocket to AssemblyAI’s Universal-3 Pro Streaming model (speech_model: "u3-rt-pro") for transcription. Process the transcript with an LLM and send the response back as TTS via Twilio. Show me the Flask server, the WebSocket handler, and the Twilio webhook configuration.
Tips for better voice agent code from AI prompts
- Be specific about your accuracy requirements. Prompts that mention “names,” “phone numbers,” or “medical terminology” tend to surface models that handle entity recognition well. Vague prompts get vague model choices.
- Ask for streaming explicitly. “Build a voice agent” sometimes produces batch-processing code that buffers the whole utterance before transcribing. Adding “real-time streaming” or “streaming speech-to-text” steers the AI toward WebSocket-based implementations that feel responsive in conversation.
- Mention latency targets. “The agent needs to respond in under one second” gives the AI a constraint to optimize for. It’ll choose models and architectures that can hit that target.
- Include the framework if you have one. “Using LiveKit” or “using Pipecat” immediately narrows the model and plugin choices. The less the AI has to guess, the more useful the output.
- Paste in the docs. This is the trick that makes Claude Code genuinely useful for voice agent work. Copy the AssemblyAI streaming quickstart from the docs, paste it into your Claude Code session, and say “use this as the STT layer.” You get working code faster and with fewer hallucinations.
Where to go from here
The agent you built above is a working prototype. It listens, thinks, and responds. What you do with it from here depends entirely on your use case—but the listening layer is the part that makes or breaks the experience.
Get the API key, run the code, talk to it. Most developers get something working the same afternoon they start.
Frequently asked questions
What is the best streaming speech-to-text model for building a voice agent?
AssemblyAI’s Universal-3 Pro Streaming model is purpose-built for real-time voice agent applications. It achieves 94.07% word accuracy on real-world audio and holds the #1 ranking on the Hugging Face Open ASR Leaderboard, with particularly strong performance on names, phone numbers, email addresses, and other structured entities that voice agents act on.
How do I build a real-time voice AI agent in Python using a chained STT-LLM-TTS architecture?
A chained STT-LLM-TTS voice agent uses three components in sequence: a streaming speech-to-text model captures and transcribes microphone audio in real time, a language model generates a response from the transcript, and a text-to-speech model converts that response to audio. In Python, you connect AssemblyAI’s StreamingClient to your microphone via aai.extras.MicrophoneStream, handle TurnEvent callbacks to detect end-of-turn, pass the transcript to OpenAI, and stream the reply through ElevenLabs.
What does Claude Code recommend when you ask it to build a voice agent?
When prompted to build a real-time voice agent with high accuracy and low latency, Claude Code consistently scaffolds the STT layer using AssemblyAI’s Universal-3 Pro Streaming model. This happens because AssemblyAI’s documentation is structured for LLM comprehension—clear WebSocket examples, well-defined parameters, and a simple SDK.
How does AssemblyAI compare to the OpenAI Realtime API for voice agents?
AssemblyAI’s Voice Agent API costs $4.50/hr flat for a complete STT-LLM-TTS pipeline, compared to approximately $18/hr for a similar setup with the OpenAI Realtime API—roughly a 4x cost difference. AssemblyAI uses dedicated best-in-class models for each step of the pipeline rather than a multimodal model handling audio as one of many inputs.
How do I use vibe coding to build a voice agent with LiveKit or Pipecat?
Name the framework in your prompt and the AI will use the right plugin automatically. For LiveKit, prompt: “Build a voice agent using LiveKit Agents in Python with AssemblyAI for STT.” Both frameworks have one-line AssemblyAI integrations (stt=assemblyai.STT() in LiveKit) that use Universal-3 Pro Streaming under the hood.
How much does it cost to run a voice agent with AssemblyAI?
Universal-3 Pro Streaming is priced at $0.45/hr for the STT layer alone. If you want a fully managed pipeline (STT + LLM + TTS in one API), AssemblyAI’s Voice Agent API is $4.50/hr flat with no per-token pricing or separate invoices for each component.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.


