Insights & Use Cases
April 8, 2026

How to build a voice agent with Python in 5 minutes

Python voice agent tutorial: build a real-time app with AssemblyAI speech-to-text, GPT-4, and ElevenLabs voice output in 5 minutes with clear Python code examples.

Kelsey Foster
Growth
Reviewed by
No items found.
Table of contents

This tutorial shows you how to build a complete voice agent that listens, thinks, and responds naturally using Python. You’ll create a streaming application that processes speech in real-time, generates intelligent responses, and speaks back to users—all in under 100 lines of code.

The voice agent combines three APIs: AssemblyAI’s Universal-3 Pro Streaming model for speech-to-text, OpenAI’s GPT-4 for conversational AI, and ElevenLabs for natural voice synthesis. Each component streams data to minimize response delays and create smooth, human-like conversations.

What you’ll need to get started

You need Python 3.9 or higher, three API keys, and a computer with a microphone and speakers. The setup takes about 2 minutes once you have everything ready.

Install Python dependencies

Open your terminal and run this command to install everything you need:

pip install "assemblyai>=1.0.0" openai "elevenlabs>=1.0.0" pyaudio python-dotenv

Here’s what each package does for your voice agent:

  • assemblyai: Handles real-time speech recognition with Universal-3 Pro Streaming
  • openai: Connects to GPT models for smart responses
  • elevenlabs: Creates natural-sounding voices
  • pyaudio: Provides access to your microphone
  • python-dotenv: Loads your API keys from a .env file

Configure your API keys

Create a .env file in your project directory with your API keys:

ASSEMBLYAI_API_KEY=your_assemblyai_key_here OPENAI_API_KEY=your_openai_key_here ELEVENLABS_API_KEY=your_elevenlabs_key_here

Never share this file or commit it to version control. Add .env to your .gitignore file to protect your API keys.

What are the components of a voice agent?

A voice agent is a program that talks to you like a human using three connected parts. These parts work together to create conversations: speech-to-text converts your voice into text, a language model thinks about what you said and creates a response, and text-to-speech turns that response back into spoken words.

This pipeline needs to work in real-time to feel natural. When you speak to Siri or Alexa, you expect quick responses—not awkward pauses that break the conversational flow. Here’s what each component does in your voice agent:

Component Role Why Streaming Matters Example
AssemblyAI Speech-to-text Transcribes audio as it arrives, so the LLM can start responding sooner Converts "what's the weather" to text before you finish speaking
OpenAI GPT-4 Language model Generates a response token by token, so text-to-speech can begin immediately Starts answering while still composing the full response
ElevenLabs Text-to-speech Plays audio while more is being generated Speaks the first sentence while generating the second

The difference between good and bad voice agents comes down to speed. Batch processing—where each step waits for the previous one to finish completely—creates those robotic pauses that make conversations feel unnatural.

AssemblyAI’s Universal-3 Pro Streaming model solves this by processing speech as it happens. You get accurate transcription with minimal delay, making conversations feel smooth and responsive.

Set up speech-to-text with AssemblyAI

Speech recognition forms the foundation of your voice agent. AssemblyAI’s Universal-3 Pro Streaming API listens to your microphone and converts speech to text in real-time. The SDK handles all WebSocket complexity automatically—no manual connection management required.

Create a new file called voice_agent.py and add this code:

import assemblyai as aai
 
from assemblyai.streaming.v3 import (
    BeginEvent,
    StreamingClient,
    StreamingClientOptions,
    StreamingError,
    StreamingEvents,
    StreamingParameters,
    TurnEvent,
    TerminationEvent,
)
 
from dotenv import load_dotenv
import os
 
load_dotenv()
 
 
class VoiceAgent:
 
    def __init__(self):
        self.client = StreamingClient(
            StreamingClientOptions(
                api_key=os.getenv('ASSEMBLYAI_API_KEY'),
                api_host="streaming.assemblyai.com",
            )
        )
 
        self.client.on(StreamingEvents.Begin, self.on_begin)
        self.client.on(StreamingEvents.Turn, self.on_turn)
        self.client.on(StreamingEvents.Termination, self.on_terminated)
        self.client.on(StreamingEvents.Error, self.on_error)
 
        self.is_processing = False
 
    def on_begin(self, event: BeginEvent):
        print("Listening... Start speaking!")
 
    def on_turn(self, turn: TurnEvent):
        if not turn.transcript:
            return
 
        if turn.end_of_turn:
            print(f"You said: {turn.transcript}")
            # AI processing added in next section
        else:
            print(f"Hearing: {turn.transcript}", end="\r")
 
    def on_error(self, error: StreamingError):
        print(f"Error: {error}")
 
    def on_terminated(self, event: TerminationEvent):
        print("Connection closed")

This code creates a real-time transcription system that gives you two types of output. Partial transcripts (where end_of_turn is False) show you what the system is hearing as you speak, and final transcripts (where end_of_turn is True) provide the complete sentence when you pause.

Universal-3 Pro uses punctuation-based turn detection—it ends a turn when it detects terminal punctuation (. ? !) after a natural pause. This means you don’t need to press buttons or give special commands—just speak naturally and pause.

Try Speech-to-Text in Playground

Validate accuracy and formatting on your own audio before wiring up the full agent. No code required to start.

Open playground

Connect the language model

The language model is the brain of your voice agent—it understands what you said and decides how to respond. OpenAI’s GPT-4 streams responses token by token, so ElevenLabs can begin speaking before the full response is generated.

Add this OpenAI integration to your VoiceAgent class:

from openai import OpenAI
 
 
class VoiceAgent:
 
    def __init__(self):
        # Previous code...
 
        self.openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
 
        self.conversation = [
            {"role": "system", "content": """You are a helpful voice assistant.
 
Keep responses short and conversational.
 
Talk like you're having a normal conversation with someone."""}
        ]
 
    def process_with_llm(self, user_text):
        self.conversation.append({"role": "user", "content": user_text})
 
        response_text = ""
 
        stream = self.openai_client.chat.completions.create(
            model="gpt-4",
            messages=self.conversation,
            stream=True,
            temperature=0.7,
            max_tokens=150
        )
 
        print("Assistant: ", end="")
 
        for chunk in stream:
            if chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                response_text += content
                print(content, end="", flush=True)
 
        print()
 
        self.conversation.append({"role": "assistant", "content": response_text})
 
        self.speak(response_text)

The conversation list keeps the full chat history, so your agent remembers context across turns. The system prompt instructs GPT-4 to keep responses short and conversational—long answers feel awkward when spoken aloud.

Add text-to-speech output

Text-to-speech completes your voice agent by converting AI responses into natural-sounding speech. ElevenLabs provides high-quality voice synthesis that starts playing audio before the entire response finishes generating.

Add voice synthesis to your VoiceAgent class:

from elevenlabs.client import ElevenLabs
from elevenlabs import stream as play_stream
import threading
 
 
class VoiceAgent:
 
    def __init__(self):
        # Previous code...
 
        self.elevenlabs_client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY'))
 
        self.voice_id = "EXAVITQu4vr4xnSDxMaL"  # Sarah voice
 
    def speak(self, text):
 
        def generate_and_play():
            try:
                audio_stream = self.elevenlabs_client.text_to_speech.stream(
                    voice_id=self.voice_id,
                    text=text,
                    model_id="eleven_turbo_v2_5",
                )
 
                play_stream(audio_stream)
 
            except Exception as e:
                print(f"Voice error: {e}")
 
        thread = threading.Thread(target=generate_and_play, daemon=True)
        thread.start()

ElevenLabs offers different voices with distinct personalities:

  • Sarah (EXAVITQu4vr4xnSDxMaL): Clear, professional female voice (used in this example)
  • Josh (TxGEqnHWrfWFTfGW9XjX): Warm, friendly male voice
  • Elli (MF3mGyEYCl7XYWbV9V6O): Young, energetic female voice

The background thread prevents voice synthesis from freezing your program. While audio generates and plays, your voice agent continues listening for the next input.

Build the complete voice agent

Here’s your complete voice_agent.py file:

import assemblyai as aai
import os
import sys
import threading
 
from assemblyai.streaming.v3 import (
    BeginEvent,
    StreamingClient,
    StreamingClientOptions,
    StreamingError,
    StreamingEvents,
    StreamingParameters,
    TurnEvent,
    TerminationEvent,
)
 
from elevenlabs.client import ElevenLabs
from elevenlabs import stream as play_stream
from openai import OpenAI
from dotenv import load_dotenv
 
load_dotenv()
 
 
class VoiceAgent:
 
    def __init__(self):
        self.client = StreamingClient(
            StreamingClientOptions(
                api_key=os.getenv('ASSEMBLYAI_API_KEY'),
                api_host="streaming.assemblyai.com",
            )
        )
 
        self.client.on(StreamingEvents.Begin, self.on_begin)
        self.client.on(StreamingEvents.Turn, self.on_turn)
        self.client.on(StreamingEvents.Termination, self.on_terminated)
        self.client.on(StreamingEvents.Error, self.on_error)
 
        self.elevenlabs_client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY'))
        self.openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
 
        self.is_processing = False
        self.voice_id = "EXAVITQu4vr4xnSDxMaL"
 
        self.conversation = [
            {"role": "system", "content": """You are a helpful voice assistant.
 
Keep responses short and conversational.
 
Talk like you're having a normal conversation with someone."""}
        ]
 
    def on_begin(self, event: BeginEvent):
        print("\n Voice Agent Ready! Start speaking...\n")
 
    def on_turn(self, turn: TurnEvent):
        if not turn.transcript:
            return
 
        if turn.end_of_turn:
            print("\r" + " " * 50 + "\r", end="")
            print(f"You: {turn.transcript}")
 
            if not self.is_processing:
                self.is_processing = True
                self.process_with_llm(turn.transcript)
                self.is_processing = False
        else:
            print(f"Listening: {turn.transcript}...", end="\r")
 
    def on_error(self, error: StreamingError):
        print(f"\n Error: {error}\n")
 
    def on_terminated(self, event: TerminationEvent):
        print("\n Voice Agent stopped\n")
 
    def process_with_llm(self, user_text):
        self.conversation.append({"role": "user", "content": user_text})
 
        response_text = ""
 
        stream = self.openai_client.chat.completions.create(
            model="gpt-4",
            messages=self.conversation,
            stream=True,
            temperature=0.7,
            max_tokens=150
        )
 
        print("Agent: ", end="")
 
        for chunk in stream:
            if chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                response_text += content
                print(content, end="", flush=True)
 
        print()
 
        self.conversation.append({"role": "assistant", "content": response_text})
 
        self.speak(response_text)
 
    def speak(self, text):
 
        def generate_and_play():
            try:
                audio_stream = self.elevenlabs_client.text_to_speech.stream(
                    voice_id=self.voice_id,
                    text=text,
                    model_id="eleven_turbo_v2_5",
                )
 
                play_stream(audio_stream)
 
            except Exception as e:
                print(f"Voice error: {e}")
 
        voice_thread = threading.Thread(target=generate_and_play)
        voice_thread.daemon = True
        voice_thread.start()
 
    def start(self):
        self.client.connect(
            StreamingParameters(
                sample_rate=16000,
                speech_model="u3-rt-pro",
            )
        )
 
        try:
            self.client.stream(aai.extras.MicrophoneStream(sample_rate=16000))
        except KeyboardInterrupt:
            self.stop()
 
    def stop(self):
        print("\nStopping voice agent...")
        self.client.disconnect(terminate=True)
        sys.exit(0)
 
 
if __name__ == "__main__":
    agent = VoiceAgent()
    agent.start()

This complete implementation includes error handling, conversation memory, and clean shutdown. The agent remembers what you’ve talked about during each session, enabling natural back-and-forth conversations.

Run your voice agent

Start your voice agent with this command:

python voice_agent.py

When you see "Voice Agent Ready! Start speaking..." the agent is listening. Speak normally into your microphone and the agent responds with both text and voice.

Try these conversation starters to test your agent:

  • "What’s the weather like today?"
  • "Tell me a quick joke"
  • "Help me plan dinner"
  • "Explain how WiFi works simply"

Common problems and solutions:

  • No microphone input: Check system permissions and microphone settings
  • Slow responses: Test your internet connection and consider using gpt-3.5-turbo for faster processing
  • Voice cuts off: Add a small delay after TTS playback or check your ElevenLabs API quota

Final words

You’ve built a complete streaming voice agent that processes speech in real-time and responds with natural conversation. This implementation combines speech recognition, AI processing, and voice synthesis into a single program that demonstrates the power of modern Voice AI models.

AssemblyAI’s Universal-3 Pro Streaming model makes this possible by providing the accuracy and speed that voice agents require. The SDK handles complex WebSocket connections and audio processing, letting you focus on building your application instead of managing low-level networking code.

To go further, explore the Universal-3 Pro Streaming docs for advanced features like keyterm prompting, speaker diarization, and real-time configuration updates—all without restarting your agent.

Start building with AssemblyAI

Get your API key and access the Universal-3 Pro Streaming model used in this tutorial. Spin up real-time transcription in minutes with the SDK.

Get API key

Frequently asked questions

Do I need WebSocket knowledge to build this voice agent?

No. The AssemblyAI Python SDK handles the WebSocket connection, reconnection logic, and audio streaming protocol automatically. You write event handlers and the SDK takes care of the rest.

How much does running this voice agent cost per hour?

This voice agent costs approximately $0.50–$1.00 per hour of conversation across all three services. AssemblyAI charges about $0.45/hr for Universal-3 Pro Streaming transcription, OpenAI costs roughly $0.30/hour for GPT-4 responses, and ElevenLabs runs about $0.20/hour for voice synthesis.

Can I replace AssemblyAI with a different speech-to-text service?

While technically possible, switching providers requires implementing WebSocket handling, audio streaming protocols, turn detection logic, and connection management yourself. You’d lose AssemblyAI’s built-in punctuation-based turn detection and the SDK simplicity that enables 5-minute implementation.

Can I use this pattern inside a framework like Pipecat or LiveKit?

Yes — AssemblyAI has first-party integrations for Pipecat, LiveKit, Vapi, and Twilio. These frameworks handle telephony, orchestration, and turn-taking so you can focus on your agent’s logic.

Does this work with languages other than English?

Yes. You can configure the speech model for other languages using StreamingParameters, for example: StreamingParameters(speech_model="u3-rt-pro", sample_rate=16_000, prompt="Transcribe Spanish."). For a full list of supported languages, see the Supported Languages page.

Title goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Button Text
AI voice agents
Tutorial