Build & Learn
August 7, 2025

Build a real-time AI voice bot using Python, AssemblyAI, and ElevenLabs

Learn how to build a real-time AI voice bot using Python, AssemblyAI's Universal-Streaming speech-to-text, OpenAI, and ElevenLabs.

Smitha Kolan
Developer Educator
Smitha Kolan
Developer Educator
Reviewed by
Ryan O'Connor
Senior Developer Educator
Ryan O'Connor
Senior Developer Educator
Table of contents

AI voice bots are rapidly transforming how businesses handle customer interactions. For developers, this presents a significant opportunity to build intelligent, scalable solutions that improve the efficiency and user experience of customer interactions.

This written tutorial will guide you through the process of building an AI-powered dental assistant in Python, using AssemblyAI for speech-to-text, OpenAI for generating responses, and ElevenLabs for voice synthesis. 

The complete code for this tutorial can also be found in this GitHub repository.

Key components of the AI voice bot

There are three major components of an AI voice bot:

  • Universal-Streaming Speech-to-Text: AssemblyAI's Universal-Streaming Speech-to-Text API enables real-time transcription with high accuracy.
  • Natural Language Processing (NLP): OpenAI's language models generate intelligent, context-aware responses.
  • Voice Synthesis: ElevenLabs synthesizes text responses into natural-sounding audio, completing the conversational loop.

The below steps will demonstrate how to build the AI voice bot, including code snippets and an overview of how each component interacts to form a cohesive voice bot.

Step 1: Install required Python libraries

To begin, run the following commands in terminal to ensure that the necessary libraries are installed:

brew install portaudio mpv
pip install "assemblyai[extras]" elevenlabs openai 

These libraries will power the core functionalities of the AI voice bot: streaming transcription, response generation, and speech synthesis.

Step 2: Import libraries and set up credentials

In this step, start by first importing the libraries needed for this project. This project requires API credentials for AssemblyAI, OpenAI & ElevenLabs. Start by signing up for AssemblyAI's API and set up billing to access Universal-Streaming transcription. As a first time user, you'll get $50 in free credits that can be used for both asynchronous and real-time transcription, audio intelligence, and other features.

Create a new file in the project directory called main.py and add the following code, making sure to replace the string placeholders for the AssemblyAI, OpenAI, and ElevenLabs API keys with your personal key values.

import assemblyai as aai
from assemblyai.streaming.v3 import (
    BeginEvent,
    StreamingClient,
    StreamingClientOptions,
    StreamingError,
    StreamingEvents,
    StreamingParameters,
    TerminationEvent,
    TurnEvent,
)
from elevenlabs import ElevenLabs, play, VoiceSettings
from openai import OpenAI
import logging
from typing import Type

# Configuration
DEBUG_MODE = False  # Set to True to see all logs and debug
messages

# Configure logging based on DEBUG_MODE
if DEBUG_MODE:
    logging.basicConfig(level=logging.INFO)
else:
    logging.basicConfig(level=logging.CRITICAL)  # Suppress all
logs except critical
    # Also suppress HTTP logs from libraries for clean
conversation output
    logging.getLogger("httpx").setLevel(logging.CRITICAL)
    logging.getLogger("httpcore").setLevel(logging.CRITICAL)
    logging.getLogger("openai").setLevel(logging.CRITICAL)
    logging.getLogger("elevenlabs").setLevel(logging.CRITICAL)
    
logger = logging.getLogger(__name__)


class AI_Assistant:
    def __init__(self):
        # API Keys - Use python-dotenv or similar to manage these
securely in production.        
        self.assemblyai_api_key = "ASSEMBLYAI_API_KEY" # Replace
with your actual AssemblyAI API key
        self.openai_client = OpenAI(api_key="OPEN_AI_API_KEY")  #
Replace with your actual OpenAI API key
        self.elevenlabs_api_key = "ELEVEN_LABS_API_KEY"  #
Replace with your actual ElevenLabs API key
        
        # Initialize ElevenLabs client
        self.elevenlabs_client =
ElevenLabs(api_key=self.elevenlabs_api_key)
        
        self.client = None
        self.microphone_stream = None
        
        # Prompt
        self.full_transcript = [
            {"role": "system", "content": "You are a receptionist
at a dental clinic. Be resourceful and efficient."},
        ]
        
        # Track conversation state for latency optimization
        self.is_processing = False
        self.running_transcript = ""  # Accumulates finalized
transcripts
        self.latest_partial = ""      # Current partial
transcript
        self.should_process_on_next_final = False  # Flag to
process when we see end_of_turn
        
        # Store reference to AI assistant for use in callbacks
        global ai_assistant_instance
        ai_assistant_instance = self

Step 3: Universal-Streaming transcription with AssemblyAI

At the core of the AI voice bot is Universal-Streaming transcription. AssemblyAI's StreamingClient handles the audio stream and listens for the different types of events emitted during the streaming session. In this step, you'll set up a real-time streaming client and callback functions that handle the different events.

First, create a function called `start_transcription` to create the streaming client, set up event handlers, and start streaming from a microphone:

   def start_transcription(self):
        # Create the streaming client
        self.client = StreamingClient(
            StreamingClientOptions(
                api_key=self.assemblyai_api_key,
                api_host="streaming.assemblyai.com",
            )
        )
        
        # Set up event handlers
        self.client.on(StreamingEvents.Begin, on_begin)
        self.client.on(StreamingEvents.Turn, on_turn)
        self.client.on(StreamingEvents.Termination, on_terminated)
        self.client.on(StreamingEvents.Error, on_error)
        
        # Connect with parameters
        self.client.connect(
            StreamingParameters(
                sample_rate=16000,
                format_turns=False,  # Disabled for lowest latency
                end_of_turn_confidence_threshold=0.7,
                min_end_of_turn_silence_when_confident=160,
                max_turn_silence=2400,
            )
        )
        
        # Start streaming from microphone
        self.microphone_stream = aai.extras.MicrophoneStream(sample_rate=16000)
        self.client.stream(self.microphone_stream)

Similarly, we'll implement a stop_transcription function to cleanly disconnect from the streaming service. While it's possible to keep the stream open during LLM processing, this would require complex conversation state management and interrupt handling. For this tutorial, we'll use a simpler approach: stop transcription while generating the AI response, then restart it when ready to listen again.

  def stop_transcription(self):
        if self.client:
            self.client.disconnect(terminate=True)
            self.client = None
        if self.microphone_stream:
            self.microphone_stream = None

Now let’s create a function to handle transcripts called `on_turn`. Like the other event handler functions, this will also be defined outside the `AI_Assistant` class. With Universal-Streaming (v3), the streaming client returns immutable transcripts via Turn objects. Each Turn contains a `transcript` field with only finalized words, an `end_of_turn` boolean indicating if the speaker has finished their turn, and other metadata like `end_of_turn_confidence` and individual `words` with timing information. Unlike v2's partial/final transcript model, Universal-Streaming provides a continuous stream of text that builds incrementally without overwriting previous content.

The `on_turn` function accumulates text in `latest_partial` and processes it when `end_of_turn` is True. This approach achieves the lowest possible latency by immediately sending the accumulated transcript to the LLM as soon as the user finishes speaking, without waiting for formatted text. The function also handles duplicate processing prevention by tracking whether the next end-of-turn event should be ignored (since it corresponds to text already sent to the LLM).

ai_assistant_instance = None


def on_turn(self: Type[StreamingClient], event: TurnEvent):
    if not event.transcript or ai_assistant_instance.is_processing:
        return
    
    # Always update latest partial and show real-time
transcription
    ai_assistant_instance.latest_partial = event.transcript
    print(f"\r{event.transcript}", end='', flush=True)
    
    # Check if this is marked as end of turn
    if event.end_of_turn:
        if ai_assistant_instance.should_process_on_next_final:
            # This is the final for the partial we already
processed - ignore it
            ai_assistant_instance.should_process_on_next_final = False
            # Clear latest_partial since it was already included
in the LLM call
            ai_assistant_instance.latest_partial = ""
        else:
            # This is a new final - process immediately for
lowest latency
            ai_assistant_instance.should_process_on_next_final =
True
            ai_assistant_instance.process_turn()

Create a method called `process_turn` that should be called when AssemblyAI detects the user has finished speaking (i.e. when `end_of_turn` is True). It combines any finalized text from `running_transcript` with the current `latest_partial`, clears the running transcript buffer, and sends the complete text to the LLM for processing. This approach ensures the lowest possible latency by processing text immediately when the user finishes speaking. If you're only interested in processing final transcripts without the latency optimization, you could simplify this by just sending event.transcript directly when end_of_turn is True.

   def process_turn(self):
        """Process the accumulated transcript following AssemblyAI's
recommended strategy"""
        # Combine running transcript with latest partial
        complete_text = self.running_transcript
        if self.latest_partial:
            if complete_text:
                complete_text += " " + self.latest_partial
            else:
                complete_text = self.latest_partial
        
        # Clear running_transcript
        self.running_transcript = ""
        # Note: We keep latest_partial as it will become final later
        
        # Process with LLM
        if complete_text.strip():
            self.generate_ai_response(complete_text)

In the main.py file, add the following functions outside the AI_Assistant class to handle the other Universal-Streaming message types:

def on_begin(self: Type[StreamingClient], event: BeginEvent):
    if DEBUG_MODE:
        logger.info(f"Session started: {event.id}")
        print(f"Session ID: {event.id}")
    print("\n[Listening... Start speaking]")


def on_terminated(self: Type[StreamingClient], event:
TerminationEvent):
    if DEBUG_MODE:
        logger.info(f"Session terminated: {event.audio_duration_seconds} seconds of audio processed")


def on_error(self: Type[StreamingClient], error: StreamingError):
    if DEBUG_MODE:
        print(f"An error occurred: {error}")
        logger.error(f"Streaming error: {error}")

Step 4: Generate responses with OpenAI

Once the code for real-time transcription has been written, use OpenAI's API to generate context-aware responses based on the conversation. To the same AI_Assistant class, add this following method:

   def generate_ai_response(self, transcript_text):
        self.is_processing = True
        self.stop_transcription()
        
        self.full_transcript.append({"role": "user", "content":
 transcript_text})
        print(f"\nPatient: {transcript_text}\n")
        
        # Stream the response from OpenAI
        stream = self.openai_client.chat.completions.create(
            model="gpt-4.1-mini",
            messages=self.full_transcript,
            stream=True  # Enable streaming
        )
        
        # Print AI response header
        print(f"AI Receptionist: ", end="", flush=True)
        
        # Collect the full response while streaming
        ai_response = ""
        for chunk in stream:
            if chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                ai_response += content
                print(content, end="", flush=True)  # Print as it
streams
        
        print()  # New line after streaming completes
        
        # Generate audio with the complete response
        self.generate_audio(ai_response)
        
        self.is_processing = False
        # Reset state before restarting
        self.running_transcript = ""
        self.latest_partial = ""
        self.should_process_on_next_final = False
        
        self.start_transcription()

This method is called by `process_turn` when it detects the user has finished speaking. It handles the complete conversation flow: first stopping the transcription to prevent interruptions, then appending the user's message to the conversation history and sending it to OpenAI for response generation. The method streams OpenAI's response in real-time, displaying it character by character as it's generated, then passes the complete response to ElevenLabs for voice synthesis. After the audio plays, it resets the transcription state and restarts listening for the next user input.

Step 5: Voice synthesis with ElevenLabs

Once OpenAI generates a response based on the user's input, it is then passed to ElevenLabs API for voice synthesis. Here, the text is converted into a natural-sounding audio stream that can be played back to the user. In the same main.py file, add this following method:

   def generate_audio(self, text):
        self.full_transcript.append({"role": "assistant",
"content": text})
        
        # Generate audio using the new ElevenLabs API
        audio = self.elevenlabs_client.text_to_speech.convert(
            text=text,
            voice_id="pNInz6obpgDQGcFmaJgB",
            output_format="mp3_22050_32",
            model_id="eleven_turbo_v2_5",
            voice_settings=VoiceSettings(
                stability=0.0,
                similarity_boost=1.0,
                style=0.0,
                use_speaker_boost=True,
                speed=1.0,
            ),
        )
        
        # Play the audio
        play(audio)

This method first appends the response from OpenAI into the full_transcript list which keeps track of the full conversation. It then sends the most recent response from OpenAI to ElevenLabs for text-to-speech by making use of the `convert` method. Finally, use the `play` method to play audio of the speech.

Step 6: Finalizing the AI voice bot

In this step, add the following code to the end of the main.py file outside of the `AI_Assistant` class:

if __name__ == "__main__":
    greeting = "Thank you for calling Vancouver dental clinic. My
name is Sonny, how may I assist you?"
    print(f"AI Receptionist: {greeting}")  # Print the initial
greeting
    ai_assistant = AI_Assistant()
    ai_assistant.generate_audio(greeting)
    
    try:
        ai_assistant.start_transcription()
        # Keep the program running
        if DEBUG_MODE:
            input("Press Enter to stop...\n")
        else:
            input()  # Silent input wait
    except KeyboardInterrupt:
        if DEBUG_MODE:
            print("\nStopping...")
    finally:
        ai_assistant.stop_transcription()

This completes the code for the AI voice bot setup, enabling it to handle real-time conversations, transcribing and responding to users in a natural, human-like voice. To start using it, run this file in terminal with the following command:

python main.py

Once the application starts running, you will hear a prompt from the AI voice bot, to which you can respond to by speaking and this will enable the conversation to start.

Conclusion

By integrating AssemblyAI, OpenAI, and ElevenLabs, this tutorial demonstrates how to build a powerful AI voice bot capable of managing real-time conversations. This application is ideal for call centers, customer support, and virtual receptionists, where human-like interaction is a key part of user experience.

The complete code for this tutorial can also be found in this GitHub repository

To get started building voice agents with AssemblyAI's Universal-Streaming Speech-to-Text API, check out the docs to learn more.

Title goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Button Text
Tutorial
Streaming Speech-to-Text
Python