Building a Voice Agent with Pipecat and AssemblyAI
Overview
Build a complete voice agent from scratch using Pipecat and AssemblyAI’s streaming speech-to-text with advanced turn detection. This guide walks you through creating a fully functional voice agent that can have natural conversations with users in real-time.
Pipecat is an open-source framework for building conversational AI applications, created by Daily.co. Daily.co is a platform that provides real-time video and audio APIs, and they built Pipecat to make it easier for developers to create AI-powered voice experiences. Pipecat provides the infrastructure for real-time voice interactions, handling the complex orchestration of speech-to-text (STT), large language models (LLM), and text-to-speech (TTS) services.
Pipecat specializes in building AI-powered voice agents and handles the real-time media processing pipeline, allowing you to focus on your agent’s behavior rather than the underlying technical complexity.
New to Pipecat? This guide assumes no prior Pipecat experience and walks you through everything from setup to deployment.
What you’ll build
By the end of this guide, you’ll have:
- A real-time voice agent with sub-second response times
- Natural conversation flow with AssemblyAI’s advanced turn detection model
- Voice Activity Detection based interruption handling for responsive interactions
- A complete setup ready for production deployment
Prerequisites
- Python 3.10 or higher
- A microphone and speakers/headphones
- API keys for AssemblyAI, OpenAI, and Cartesia
Step 1: Installation
Create a new Python environment and install the required packages:
Create virtual environment
Install Pipecat with all required plugins
Download the Pipecat run helper file:
Linux/macOS
Windows (PowerShell)
Step 2: Get API Keys
To build your voice agent, you’ll use:
- AssemblyAI for STT (speech-to-text)
- GPT-4o mini from OpenAI for the LLM (language model)
- Cartesia for TTS (text-to-speech)
You’ll need API keys for each:
AssemblyAI (STT)
- Sign up: assemblyai.com/signup
- API Key: assemblyai.com/api-keys
OpenAI (LLM)
- Sign up: auth.openai.com/create-account
- API Key: platform.openai.com/api-keys
Cartesia (TTS)
- Sign up: cartesia.ai/sign-up
- API Key: cartesia.ai/keys
Looking for alternatives? Pipecat supports multiple TTS and LLM providers.
Explore the full plugin list here.
Step 3: Environment Setup
Create a .env
file in your project directory with your API keys:
Replace the placeholder values with your actual API keys from Step 2.
Step 4: Create Your Voice Agent
Pipecat has many examples for testing, which you can see in their quickstart guide and all their examples on GitHub.
You can utilize AssemblyAI within any of these Pipecat examples as long as you use us in the Pipecat Pipeline (Pipecat’s code system for voice agents) shown here:
Below is a code snippet that uses our STT within Pipecat’s basic voice agent example structure. This example is great because it comes with a great UI.
Please create a file called voice_agent.py
and copy and paste this code:
Step 5: Run Your Voice Agent
Start your voice agent:
You’ll see a URL (typically http://localhost:7860
) in the console output. Open this URL in your browser to test your voice agent!
Step 6: Test Your Voice Agent
- Open the provided URL in your browser (usually
http://localhost:7860
) - Allow microphone access when prompted by your browser
- Click “Connect” to join the session
- Start talking to your voice agent and have a conversation!
The web interface is provided automatically by Pipecat’s example framework, making testing simple and straightforward.
For more testing examples and advanced configurations, check out the Pipecat examples repository and the quickstart guide.
Configuration
Turn Detection (Key Feature)
AssemblyAI’s new turn detection model was built specifically for voice agents and you can tweak it to fit your use case. It processes both audio and linguistic information to determine an end of turn confidence score on every inference, and if that confidence score is past the set threshold, it triggers end of turn.
This custom model was designed to address 2 major issues with voice agents. With traditional VAD (voice activity detection) approaches based on silence alone, there are situations where the agent wouldn’t wait for a user to finish their turn even if the audio data suggested it. Think of a situation like “My credit card number is____” - if someone is looking that up, traditional VAD may not wait for the user, where our turn detection model is far better in these situations.
Additionally, in situations where we are certain that the user is done speaking like “What is my credit score?”, a high end of turn confidence is returned, greater than the threshold, and triggering end of turn, allowing for minimal turnaround latency in those scenarios.
Parameter tuning:
- end_of_turn_confidence_threshold: Raise or lower the threshold based on how confident you’d like us to be before triggering end of turn based on confidence score
- min_end_of_turn_silence_when_confident: Increase or decrease the amount of time we wait to trigger end of turn when confident
- max_turn_silence: Lower or raise the amount of time needed to trigger end of turn when end of turn isn’t triggered by a high confidence score
You can also set vad_force_turn_endpoint=True
if you’d like turn detection to be based on VAD instead of our advanced turn detection model.
For more information, see our Universal-Streaming end-of-turn detection guide and message-by-message breakdown.
Customizing your agent:
Modify the system message in the messages array:
This replaces the existing system message in the code. The current example uses:
For customizing Cartesia TTS voices, see the Pipecat Cartesia TTS documentation.
For configuring OpenAI models and parameters, see the Pipecat OpenAI LLM documentation.
For complete details on all AssemblyAI parameters, see the AssemblyAI Universal-Streaming API Reference.
Want to build more advanced voice agents? Pipecat has guides on building custom voice agents and workflows. Explore their documentation to see how you can add custom processors, integrate with databases, or build multi-modal experiences.
Production Deployment
When your voice agent is working well in development, it’s time to deploy it to production. Pipecat has comprehensive deployment guides to help you get started - check out their deployment overview for detailed instructions.
Pipecat Cloud
Pipecat offers a managed cloud service for deploying voice agents at scale. Check out Pipecat Cloud for managed infrastructure that handles scaling, monitoring, and deployment automatically.
Self-Hosting
Since Pipecat is open source, you have complete control over your deployment and data. You can deploy on AWS, Google Cloud, Azure, or any other hosting platform. Consider using containerization with Docker for easier deployment and scaling.
More Questions?
If you get stuck, or have any other questions, we’d love to help you out. Contact our support team at support@assemblyai.com or create a support ticket.