Build a real-time AI voice bot using Python, AssemblyAI, and ElevenLabs
Learn how to build a real-time AI voice bot using Python, AssemblyAI's Universal-Streaming speech-to-text, OpenAI, and ElevenLabs.



AI voice bots are rapidly transforming how businesses handle customer interactions. For developers, this presents a significant opportunity to build intelligent, scalable solutions that improve the efficiency and user experience of customer interactions.
This written tutorial will guide you through the process of building an AI-powered dental assistant in Python, using AssemblyAI for speech-to-text, OpenAI for generating responses, and ElevenLabs for voice synthesis.
The complete code for this tutorial can also be found in this GitHub repository.
Key components of the AI voice bot
There are three major components of an AI voice bot:
- Universal-Streaming Speech-to-Text: AssemblyAI's Universal-Streaming Speech-to-Text API enables real-time transcription with high accuracy.
- Natural Language Processing (NLP): OpenAI's language models generate intelligent, context-aware responses.
- Voice Synthesis: ElevenLabs synthesizes text responses into natural-sounding audio, completing the conversational loop.
The below steps will demonstrate how to build the AI voice bot, including code snippets and an overview of how each component interacts to form a cohesive voice bot.
Step 1: Install required Python libraries
To begin, run the following commands in terminal to ensure that the necessary libraries are installed:
brew install portaudio mpv
pip install "assemblyai[extras]" elevenlabs openai
These libraries will power the core functionalities of the AI voice bot: streaming transcription, response generation, and speech synthesis.
Step 2: Import libraries and set up credentials
In this step, start by first importing the libraries needed for this project. This project requires API credentials for AssemblyAI, OpenAI & ElevenLabs. Start by signing up for AssemblyAI's API and set up billing to access Universal-Streaming transcription. As a first time user, you'll get $50 in free credits that can be used for both asynchronous and real-time transcription, audio intelligence, and other features.
Create a new file in the project directory called main.py and add the following code, making sure to replace the string placeholders for the AssemblyAI, OpenAI, and ElevenLabs API keys with your personal key values.
import assemblyai as aai
from assemblyai.streaming.v3 import (
BeginEvent,
StreamingClient,
StreamingClientOptions,
StreamingError,
StreamingEvents,
StreamingParameters,
TerminationEvent,
TurnEvent,
)
from elevenlabs import ElevenLabs, play, VoiceSettings
from openai import OpenAI
import logging
from typing import Type
# Configuration
DEBUG_MODE = False # Set to True to see all logs and debug
messages
# Configure logging based on DEBUG_MODE
if DEBUG_MODE:
logging.basicConfig(level=logging.INFO)
else:
logging.basicConfig(level=logging.CRITICAL) # Suppress all
logs except critical
# Also suppress HTTP logs from libraries for clean
conversation output
logging.getLogger("httpx").setLevel(logging.CRITICAL)
logging.getLogger("httpcore").setLevel(logging.CRITICAL)
logging.getLogger("openai").setLevel(logging.CRITICAL)
logging.getLogger("elevenlabs").setLevel(logging.CRITICAL)
logger = logging.getLogger(__name__)
class AI_Assistant:
def __init__(self):
# API Keys - Use python-dotenv or similar to manage these
securely in production.
self.assemblyai_api_key = "ASSEMBLYAI_API_KEY" # Replace
with your actual AssemblyAI API key
self.openai_client = OpenAI(api_key="OPEN_AI_API_KEY") #
Replace with your actual OpenAI API key
self.elevenlabs_api_key = "ELEVEN_LABS_API_KEY" #
Replace with your actual ElevenLabs API key
# Initialize ElevenLabs client
self.elevenlabs_client =
ElevenLabs(api_key=self.elevenlabs_api_key)
self.client = None
self.microphone_stream = None
# Prompt
self.full_transcript = [
{"role": "system", "content": "You are a receptionist
at a dental clinic. Be resourceful and efficient."},
]
# Track conversation state for latency optimization
self.is_processing = False
self.running_transcript = "" # Accumulates finalized
transcripts
self.latest_partial = "" # Current partial
transcript
self.should_process_on_next_final = False # Flag to
process when we see end_of_turn
# Store reference to AI assistant for use in callbacks
global ai_assistant_instance
ai_assistant_instance = self
Step 3: Universal-Streaming transcription with AssemblyAI
At the core of the AI voice bot is Universal-Streaming transcription. AssemblyAI's StreamingClient handles the audio stream and listens for the different types of events emitted during the streaming session. In this step, you'll set up a real-time streaming client and callback functions that handle the different events.
First, create a function called `start_transcription` to create the streaming client, set up event handlers, and start streaming from a microphone:
def start_transcription(self):
# Create the streaming client
self.client = StreamingClient(
StreamingClientOptions(
api_key=self.assemblyai_api_key,
api_host="streaming.assemblyai.com",
)
)
# Set up event handlers
self.client.on(StreamingEvents.Begin, on_begin)
self.client.on(StreamingEvents.Turn, on_turn)
self.client.on(StreamingEvents.Termination, on_terminated)
self.client.on(StreamingEvents.Error, on_error)
# Connect with parameters
self.client.connect(
StreamingParameters(
sample_rate=16000,
format_turns=False, # Disabled for lowest latency
end_of_turn_confidence_threshold=0.7,
min_end_of_turn_silence_when_confident=160,
max_turn_silence=2400,
)
)
# Start streaming from microphone
self.microphone_stream = aai.extras.MicrophoneStream(sample_rate=16000)
self.client.stream(self.microphone_stream)
Similarly, we'll implement a stop_transcription function to cleanly disconnect from the streaming service. While it's possible to keep the stream open during LLM processing, this would require complex conversation state management and interrupt handling. For this tutorial, we'll use a simpler approach: stop transcription while generating the AI response, then restart it when ready to listen again.
def stop_transcription(self):
if self.client:
self.client.disconnect(terminate=True)
self.client = None
if self.microphone_stream:
self.microphone_stream = None
Now let’s create a function to handle transcripts called `on_turn`. Like the other event handler functions, this will also be defined outside the `AI_Assistant` class. With Universal-Streaming (v3), the streaming client returns immutable transcripts via Turn objects. Each Turn contains a `transcript` field with only finalized words, an `end_of_turn` boolean indicating if the speaker has finished their turn, and other metadata like `end_of_turn_confidence` and individual `words` with timing information. Unlike v2's partial/final transcript model, Universal-Streaming provides a continuous stream of text that builds incrementally without overwriting previous content.
The `on_turn` function accumulates text in `latest_partial` and processes it when `end_of_turn` is True. This approach achieves the lowest possible latency by immediately sending the accumulated transcript to the LLM as soon as the user finishes speaking, without waiting for formatted text. The function also handles duplicate processing prevention by tracking whether the next end-of-turn event should be ignored (since it corresponds to text already sent to the LLM).
ai_assistant_instance = None
def on_turn(self: Type[StreamingClient], event: TurnEvent):
if not event.transcript or ai_assistant_instance.is_processing:
return
# Always update latest partial and show real-time
transcription
ai_assistant_instance.latest_partial = event.transcript
print(f"\r{event.transcript}", end='', flush=True)
# Check if this is marked as end of turn
if event.end_of_turn:
if ai_assistant_instance.should_process_on_next_final:
# This is the final for the partial we already
processed - ignore it
ai_assistant_instance.should_process_on_next_final = False
# Clear latest_partial since it was already included
in the LLM call
ai_assistant_instance.latest_partial = ""
else:
# This is a new final - process immediately for
lowest latency
ai_assistant_instance.should_process_on_next_final =
True
ai_assistant_instance.process_turn()
Create a method called `process_turn` that should be called when AssemblyAI detects the user has finished speaking (i.e. when `end_of_turn` is True). It combines any finalized text from `running_transcript` with the current `latest_partial`, clears the running transcript buffer, and sends the complete text to the LLM for processing. This approach ensures the lowest possible latency by processing text immediately when the user finishes speaking. If you're only interested in processing final transcripts without the latency optimization, you could simplify this by just sending event.transcript directly when end_of_turn is True.
def process_turn(self):
"""Process the accumulated transcript following AssemblyAI's
recommended strategy"""
# Combine running transcript with latest partial
complete_text = self.running_transcript
if self.latest_partial:
if complete_text:
complete_text += " " + self.latest_partial
else:
complete_text = self.latest_partial
# Clear running_transcript
self.running_transcript = ""
# Note: We keep latest_partial as it will become final later
# Process with LLM
if complete_text.strip():
self.generate_ai_response(complete_text)
In the main.py file, add the following functions outside the AI_Assistant class to handle the other Universal-Streaming message types:
def on_begin(self: Type[StreamingClient], event: BeginEvent):
if DEBUG_MODE:
logger.info(f"Session started: {event.id}")
print(f"Session ID: {event.id}")
print("\n[Listening... Start speaking]")
def on_terminated(self: Type[StreamingClient], event:
TerminationEvent):
if DEBUG_MODE:
logger.info(f"Session terminated: {event.audio_duration_seconds} seconds of audio processed")
def on_error(self: Type[StreamingClient], error: StreamingError):
if DEBUG_MODE:
print(f"An error occurred: {error}")
logger.error(f"Streaming error: {error}")
Step 4: Generate responses with OpenAI
Once the code for real-time transcription has been written, use OpenAI's API to generate context-aware responses based on the conversation. To the same AI_Assistant class, add this following method:
def generate_ai_response(self, transcript_text):
self.is_processing = True
self.stop_transcription()
self.full_transcript.append({"role": "user", "content":
transcript_text})
print(f"\nPatient: {transcript_text}\n")
# Stream the response from OpenAI
stream = self.openai_client.chat.completions.create(
model="gpt-4.1-mini",
messages=self.full_transcript,
stream=True # Enable streaming
)
# Print AI response header
print(f"AI Receptionist: ", end="", flush=True)
# Collect the full response while streaming
ai_response = ""
for chunk in stream:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
ai_response += content
print(content, end="", flush=True) # Print as it
streams
print() # New line after streaming completes
# Generate audio with the complete response
self.generate_audio(ai_response)
self.is_processing = False
# Reset state before restarting
self.running_transcript = ""
self.latest_partial = ""
self.should_process_on_next_final = False
self.start_transcription()
This method is called by `process_turn` when it detects the user has finished speaking. It handles the complete conversation flow: first stopping the transcription to prevent interruptions, then appending the user's message to the conversation history and sending it to OpenAI for response generation. The method streams OpenAI's response in real-time, displaying it character by character as it's generated, then passes the complete response to ElevenLabs for voice synthesis. After the audio plays, it resets the transcription state and restarts listening for the next user input.
Step 5: Voice synthesis with ElevenLabs
Once OpenAI generates a response based on the user's input, it is then passed to ElevenLabs API for voice synthesis. Here, the text is converted into a natural-sounding audio stream that can be played back to the user. In the same main.py file, add this following method:
def generate_audio(self, text):
self.full_transcript.append({"role": "assistant",
"content": text})
# Generate audio using the new ElevenLabs API
audio = self.elevenlabs_client.text_to_speech.convert(
text=text,
voice_id="pNInz6obpgDQGcFmaJgB",
output_format="mp3_22050_32",
model_id="eleven_turbo_v2_5",
voice_settings=VoiceSettings(
stability=0.0,
similarity_boost=1.0,
style=0.0,
use_speaker_boost=True,
speed=1.0,
),
)
# Play the audio
play(audio)
This method first appends the response from OpenAI into the full_transcript list which keeps track of the full conversation. It then sends the most recent response from OpenAI to ElevenLabs for text-to-speech by making use of the `convert` method. Finally, use the `play` method to play audio of the speech.
Step 6: Finalizing the AI voice bot
In this step, add the following code to the end of the main.py file outside of the `AI_Assistant` class:
if __name__ == "__main__":
greeting = "Thank you for calling Vancouver dental clinic. My
name is Sonny, how may I assist you?"
print(f"AI Receptionist: {greeting}") # Print the initial
greeting
ai_assistant = AI_Assistant()
ai_assistant.generate_audio(greeting)
try:
ai_assistant.start_transcription()
# Keep the program running
if DEBUG_MODE:
input("Press Enter to stop...\n")
else:
input() # Silent input wait
except KeyboardInterrupt:
if DEBUG_MODE:
print("\nStopping...")
finally:
ai_assistant.stop_transcription()
This completes the code for the AI voice bot setup, enabling it to handle real-time conversations, transcribing and responding to users in a natural, human-like voice. To start using it, run this file in terminal with the following command:
python main.py
Once the application starts running, you will hear a prompt from the AI voice bot, to which you can respond to by speaking and this will enable the conversation to start.
Conclusion
By integrating AssemblyAI, OpenAI, and ElevenLabs, this tutorial demonstrates how to build a powerful AI voice bot capable of managing real-time conversations. This application is ideal for call centers, customer support, and virtual receptionists, where human-like interaction is a key part of user experience.
The complete code for this tutorial can also be found in this GitHub repository.
To get started building voice agents with AssemblyAI's Universal-Streaming Speech-to-Text API, check out the docs to learn more.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.