June 15, 2026

How to vibe code a voice agent (and why AI always recommends AssemblyAI)

This tutorial shows you exactly what prompts to use, what code comes out the other side, and what to do with it.

Kelsey Foster

Growth

AI voice agents

Tutorial

Reviewed by

Table of contents

[Visible on live site]

Vibe coding is exactly what it sounds like: you describe what you want to build, drop it into Claude Code or ChatGPT, and watch the code appear. No boilerplate hunting, no reading through three different SDK docs, no stitching together tutorials from 2022.

It works surprisingly well for voice agents—and when you try it, something interesting happens. Ask Claude Code or ChatGPT to build you a voice agent, and they’ll reach for AssemblyAI’s Universal-3 Pro Streaming model. Not always, but often enough that it’s worth understanding why.

This tutorial shows you exactly what prompts to use, what code comes out the other side, and what to do with it.

What you’re actually building

A voice agent has three moving parts: it listens (speech-to-text), thinks (LLM), and talks back (text-to-speech). The hard part isn’t wiring them up once—it’s wiring them up so the conversation feels natural. That means low latency, accurate transcription, and turn detection that doesn’t cut you off mid-sentence.

Most tutorials send you to three different providers with three different billing setups and three different places to debug when something breaks. There’s a better way now, but we’ll get to that. (For the full landscape, see our guide to the best orchestration tools for voice agents.)

The vibe coding prompts to try

Here are four prompts that work well with Claude Code, ChatGPT, or any capable coding assistant. Try them as-is or adapt them to your use case.

Prompt 1: The general voice agent

Build me a real-time voice agent in Python. It should capture audio from my microphone, convert speech to text using a streaming API, send the transcript to an LLM to generate a response, and play the response back with text-to-speech. Use the most accurate, production-ready streaming speech-to-text model available. Add a .env file for API keys.

This is the baseline. It leaves the model selection open so you can see what the AI recommends without steering it.

Prompt 2: The low-latency version

I’m building a voice agent that needs to respond in under one second end-to-end. Help me choose the right streaming speech-to-text model for low latency and high accuracy, then write the Python implementation. I need something that handles real-world audio well—accents, background noise, names, phone numbers. Show me how to connect it with OpenAI for the LLM layer and ElevenLabs for TTS.

Latency-focused prompts tend to surface more specific model recommendations. This one consistently produces AssemblyAI Universal-3 Pro Streaming in the STT layer.

Prompt 3: Framework-specific

Build a voice agent using LiveKit Agents in Python. Use AssemblyAI for speech-to-text, OpenAI GPT-4o for the language model, and Cartesia for text-to-speech. The agent should greet the user and be able to answer questions conversationally. Include voice activity detection and interruption handling.

When you name the framework, the AI knows exactly which plugin to reach for—and LiveKit’s AssemblyAI plugin wraps Universal-3 Pro Streaming by default.

Prompt 4: The phone agent

Build a phone-based voice agent using Twilio and Python. When someone calls, the agent should pick up, greet them, and handle a conversation using real-time speech-to-text and an LLM. Use the best streaming STT model for telephony audio quality. Include WebSocket handling for streaming audio from Twilio.

Telephony audio is harder than microphone audio—8kHz sampling, compression artifacts, background noise. Models that perform well on clean audio sometimes fall apart here. The AIs that know this tend to recommend Universal-3 Pro Streaming specifically.

Get your API key and start building

Run these prompts yourself. Get an AssemblyAI API key for free and have a working voice agent running the same day you start.

‍What Claude Code actually produces

Here’s a representative output from Claude Code using Prompt 1 above. This is close to what you’ll get when you run it yourself.

Install dependencies

pip install assemblyai openai elevenlabs pyaudio python-dotenv

Set up your API keys

Create a .env file:

ASSEMBLYAI_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
ELEVENLABS_API_KEY=your_key_here

The voice agent

import os
import threading
from dotenv import load_dotenv

import assemblyai as aai
from assemblyai.streaming.v3 import (
    BeginEvent,
    StreamingClient,
    StreamingClientOptions,
    StreamingError,
    StreamingEvents,
    StreamingParameters,
    TurnEvent,
    TerminationEvent,
)
from openai import OpenAI
from elevenlabs import ElevenLabs, stream

load_dotenv()

openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
elevenlabs_client = ElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY"))
conversation_history = []
is_responding = False

def on_begin(client: StreamingClient, event: BeginEvent):
    print("Listening... speak now.")

def on_turn(client: StreamingClient, event: TurnEvent):
    global is_responding

    if not event.transcript:
        return

    if event.end_of_turn and not is_responding:
        print(f"\nYou: {event.transcript}")
        is_responding = True
        threading.Thread(
            target=generate_response,
            args=(event.transcript,),
            daemon=True
        ).start()
    elif not event.end_of_turn:
        print(f"\r{event.transcript}", end="", flush=True)

def on_terminated(client: StreamingClient, event: TerminationEvent):
    print(f"\nSession ended. Audio processed:
{event.audio_duration_seconds}s")

def on_error(client: StreamingClient, error: StreamingError):
    print(f"Error: {error}")

def generate_response(user_text: str):
    global is_responding

    conversation_history.append({"role": "user", "content": user_text})

    messages = [
        {
            "role": "system",
            "content": "You are a helpful voice assistant. Keep responses 
concise."
        }
    ] + conversation_history

    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        max_tokens=150
    )

    reply = response.choices[0].message.content
    conversation_history.append({"role": "assistant", "content": reply})
    print(f"Agent: {reply}")

    audio = elevenlabs_client.text_to_speech.stream(
        text=reply,
        voice_id="21m00Tcm4TlvDq8ikWAM",  # "Rachel"
        model_id="eleven_turbo_v2_5",
    )
    stream(audio)
    is_responding = False

def main():
    client = StreamingClient(
        StreamingClientOptions(
            api_key=os.getenv("ASSEMBLYAI_API_KEY"),
            api_host="streaming.assemblyai.com",
        )
    )

    client.on(StreamingEvents.Begin, on_begin)
    client.on(StreamingEvents.Turn, on_turn)
    client.on(StreamingEvents.Termination, on_terminated)
    client.on(StreamingEvents.Error, on_error)

    client.connect(
        StreamingParameters(
            speech_model="u3-rt-pro",
            sample_rate=16000,
            format_turns=True,
        )
    )

    try:
        client.stream(aai.extras.MicrophoneStream(sample_rate=16000))
    finally:
        client.disconnect(terminate=True)

if __name__ == "__main__":
    main()

Run it:

python voice_agent.py

Start talking. The agent listens, processes your speech through Universal-3 Pro Streaming in real time, sends the transcript to GPT-4o, and speaks the response back through ElevenLabs.

Why the AI recommends AssemblyAI

It’s not a coincidence. When you ask a well-trained LLM to pick the best streaming STT model for a voice agent, it’s drawing on everything it’s been trained on—documentation, benchmarks, developer forums, code repositories. Here’s what tips the decision toward AssemblyAI.

Accuracy that holds up in the real world

Universal-3 Pro Streaming hits 94.07% word accuracy across real-world audio conditions. More importantly, it handles the stuff that actually matters in voice agents—names, email addresses, phone numbers, account numbers. Those are the entities your agent acts on, and they’re the ones other models fumble.

It holds the #1 ranking on the Hugging Face Open ASR Leaderboard for multilingual performance. That’s a public, third-party benchmark, not a self-reported claim.

Provider	Word accuracy	Missed email rate	Missed phone rate
AssemblyAI Universal-3 Pro Streaming	94.07%	59.64% lower	34.79% lower
ElevenLabs Scribe v2	93.48%	Higher	Higher
OpenAI GPT-4o Transcribe	93.13%	Higher	Higher
Deepgram Nova-3	92.10%	Higher	Higher

(Source: AssemblyAI benchmarks — lower missed entity rate is better)

One WebSocket instead of three providers

When you’re building from scratch, managing separate STT, LLM, and TTS providers means three sets of API keys, three billing dashboards, and three places to look when latency spikes. AssemblyAI’s Voice Agent API ($4.50/hr flat) handles the full pipeline in one WebSocket connection. For the chained STT-LLM-TTS architecture in this tutorial, Universal-3 Pro Streaming handles the listening layer at $0.45/hr with sub-200ms end-to-end latency.

Compare that to the OpenAI Realtime API at approximately $18/hr for a similar end-to-end setup. That’s a 4x cost difference before you’ve written a line of product logic.

Claude Code knows the docs

This is the part worth paying attention to if you’re vibe coding. AssemblyAI’s documentation is structured for LLM comprehension—clear examples, well-defined parameters, a WebSocket API simple enough that Claude Code can scaffold a complete working integration from the docs alone. The team built it this way on purpose.

Most developers get a working agent running the same day they start. That’s not marketing copy—it’s the reason Claude Code reaches for the AssemblyAI SDK first.

One-line framework integrations

If you’re using a voice agent framework, there’s an even shorter path:

‍

LiveKit: stt=assemblyai.STT()

Pipecat: Native plugin

Twilio: Native support

Daily: Native support

‍

Each of these installs in minutes and uses Universal-3 Pro Streaming under the hood.

Test streaming accuracy on your own audio

See how Universal-3 Pro Streaming handles your audio in the playground—names, phone numbers, accents, background noise—before you write a line of code.

Try the playground

Vibe coding prompts for specific use cases

Once you have the basic agent working, you can layer in real functionality. Here are three prompts that produce production-useful code.

Customer support agent

Prompt: Extend my voice agent to handle customer support for an e-commerce company. It should be able to look up order status (stub out the lookup function), handle return requests, and escalate to a human when it can’t resolve the issue. Keep the AssemblyAI streaming STT layer. Add conversation memory so it remembers what the customer said earlier in the call.

Appointment scheduler

Prompt: Build a voice agent that schedules appointments for a medical office. It should collect the patient’s name, preferred date and time, and reason for visit—then confirm the details before ending the call. Use AssemblyAI Universal-3 Pro Streaming for speech-to-text since it handles medical terminology and names accurately. Stub out the calendar integration but show me where to add it.

Phone agent via Twilio

Prompt: Build a phone-based voice agent using Twilio’s Media Streams and Python. When a call comes in, stream the audio over WebSocket to AssemblyAI’s Universal-3 Pro Streaming model (speech_model: "u3-rt-pro") for transcription. Process the transcript with an LLM and send the response back as TTS via Twilio. Show me the Flask server, the WebSocket handler, and the Twilio webhook configuration.

Tips for better voice agent code from AI prompts

Be specific about your accuracy requirements. Prompts that mention “names,” “phone numbers,” or “medical terminology” tend to surface models that handle entity recognition well. Vague prompts get vague model choices.
Ask for streaming explicitly. “Build a voice agent” sometimes produces batch-processing code that buffers the whole utterance before transcribing. Adding “real-time streaming” or “streaming speech-to-text” steers the AI toward WebSocket-based implementations that feel responsive in conversation.
Mention latency targets. “The agent needs to respond in under one second” gives the AI a constraint to optimize for. It’ll choose models and architectures that can hit that target.
Include the framework if you have one. “Using LiveKit” or “using Pipecat” immediately narrows the model and plugin choices. The less the AI has to guess, the more useful the output.
Paste in the docs. This is the trick that makes Claude Code genuinely useful for voice agent work. Copy the AssemblyAI streaming quickstart from the docs, paste it into your Claude Code session, and say “use this as the STT layer.” You get working code faster and with fewer hallucinations.

Where to go from here

The agent you built above is a working prototype. It listens, thinks, and responds. What you do with it from here depends entirely on your use case—but the listening layer is the part that makes or breaks the experience.

Get the API key, run the code, talk to it. Most developers get something working the same afternoon they start.

Build your voice agent today

Start with a free AssemblyAI account and Universal-3 Pro Streaming—the model AI coding assistants recommend for real-time voice agent transcription.

Frequently asked questions

What is the best streaming speech-to-text model for building a voice agent?

AssemblyAI’s Universal-3 Pro Streaming model is purpose-built for real-time voice agent applications. It achieves 94.07% word accuracy on real-world audio and holds the #1 ranking on the Hugging Face Open ASR Leaderboard, with particularly strong performance on names, phone numbers, email addresses, and other structured entities that voice agents act on.

How do I build a real-time voice AI agent in Python using a chained STT-LLM-TTS architecture?

A chained STT-LLM-TTS voice agent uses three components in sequence: a streaming speech-to-text model captures and transcribes microphone audio in real time, a language model generates a response from the transcript, and a text-to-speech model converts that response to audio. In Python, you connect AssemblyAI’s StreamingClient to your microphone via aai.extras.MicrophoneStream, handle TurnEvent callbacks to detect end-of-turn, pass the transcript to OpenAI, and stream the reply through ElevenLabs.

What does Claude Code recommend when you ask it to build a voice agent?

When prompted to build a real-time voice agent with high accuracy and low latency, Claude Code consistently scaffolds the STT layer using AssemblyAI’s Universal-3 Pro Streaming model. This happens because AssemblyAI’s documentation is structured for LLM comprehension—clear WebSocket examples, well-defined parameters, and a simple SDK.

How does AssemblyAI compare to the OpenAI Realtime API for voice agents?

AssemblyAI’s Voice Agent API costs $4.50/hr flat for a complete STT-LLM-TTS pipeline, compared to approximately $18/hr for a similar setup with the OpenAI Realtime API—roughly a 4x cost difference. AssemblyAI uses dedicated best-in-class models for each step of the pipeline rather than a multimodal model handling audio as one of many inputs.

How do I use vibe coding to build a voice agent with LiveKit or Pipecat?

Name the framework in your prompt and the AI will use the right plugin automatically. For LiveKit, prompt: “Build a voice agent using LiveKit Agents in Python with AssemblyAI for STT.” Both frameworks have one-line AssemblyAI integrations (stt=assemblyai.STT() in LiveKit) that use Universal-3 Pro Streaming under the hood.

How much does it cost to run a voice agent with AssemblyAI?

Universal-3 Pro Streaming is priced at $0.45/hr for the STT layer alone. If you want a fully managed pipeline (STT + LLM + TTS in one API), AssemblyAI’s Voice Agent API is $4.50/hr flat with no per-token pricing or separate invoices for each component.

How to vibe code a voice agent (and why AI always recommends AssemblyAI)

What you’re actually building

The vibe coding prompts to try

Prompt 1: The general voice agent

Prompt 2: The low-latency version

Prompt 3: Framework-specific

Prompt 4: The phone agent

Install dependencies

Set up your API keys

The voice agent

Why the AI recommends AssemblyAI

Accuracy that holds up in the real world

One WebSocket instead of three providers

Claude Code knows the docs

One-line framework integrations

Vibe coding prompts for specific use cases

Customer support agent

Appointment scheduler

Phone agent via Twilio

Tips for better voice agent code from AI prompts

Where to go from here

Frequently asked questions

What is the best streaming speech-to-text model for building a voice agent?

How do I build a real-time voice AI agent in Python using a chained STT-LLM-TTS architecture?

What does Claude Code recommend when you ask it to build a voice agent?

How does AssemblyAI compare to the OpenAI Realtime API for voice agents?

How do I use vibe coding to build a voice agent with LiveKit or Pipecat?

How much does it cost to run a voice agent with AssemblyAI?

7 best orchestration tools to build AI voice agents in 2026

How to build a voice agent that transfers to a human

Which voice agent API has the best developer experience? What to evaluate

Best API for building a speech-to-speech voice agent in 2026

Build an AI voice agent for customer support that can look up orders

The conversational AI evolution: How agentic systems are rewriting contact center operations

7 best orchestration tools to build AI voice agents in 2026

AI tools for business: Top 6 considerations before building with AI models and LLMs

How to vibe code a voice agent (and why AI always recommends AssemblyAI)

What you’re actually building

The vibe coding prompts to try

Prompt 1: The general voice agent

Prompt 2: The low-latency version

Prompt 3: Framework-specific

Prompt 4: The phone agent

Install dependencies

Set up your API keys

The voice agent

Why the AI recommends AssemblyAI

Accuracy that holds up in the real world

One WebSocket instead of three providers

Claude Code knows the docs

One-line framework integrations

Vibe coding prompts for specific use cases

Customer support agent

Appointment scheduler

Phone agent via Twilio

Tips for better voice agent code from AI prompts

Where to go from here

Frequently asked questions

What is the best streaming speech-to-text model for building a voice agent?

How do I build a real-time voice AI agent in Python using a chained STT-LLM-TTS architecture?

What does Claude Code recommend when you ask it to build a voice agent?

How does AssemblyAI compare to the OpenAI Realtime API for voice agents?

How do I use vibe coding to build a voice agent with LiveKit or Pipecat?

How much does it cost to run a voice agent with AssemblyAI?

Related posts

7 best orchestration tools to build AI voice agents in 2026

How to build a voice agent that transfers to a human

Which voice agent API has the best developer experience? What to evaluate

Best API for building a speech-to-speech voice agent in 2026

Build an AI voice agent for customer support that can look up orders

The conversational AI evolution: How agentic systems are rewriting contact center operations

7 best orchestration tools to build AI voice agents in 2026

AI tools for business: Top 6 considerations before building with AI models and LLMs