Building a Voice Agent with Pipecat and AssemblyAI | AssemblyAI

Overview

Build a complete voice agent from scratch using Pipecat and AssemblyAI’s streaming speech-to-text with advanced turn detection. This guide walks you through creating a fully functional voice agent that can have natural conversations with users in real-time.

Pipecat is an open-source framework for building conversational AI applications, created by Daily.co. Daily.co is a platform that provides real-time video and audio APIs, and they built Pipecat to make it easier for developers to create AI-powered voice experiences. Pipecat provides the infrastructure for real-time voice interactions, handling the complex orchestration of speech-to-text (STT), large language models (LLM), and text-to-speech (TTS) services.

Pipecat specializes in building AI-powered voice agents and handles the real-time media processing pipeline, allowing you to focus on your agent’s behavior rather than the underlying technical complexity.

New to Pipecat? This guide assumes no prior Pipecat experience and walks you through everything from setup to deployment.

Pipecat

Learn more about Pipecat’s voice agent framework.

YouTube video guide

What you’ll build

By the end of this guide, you’ll have:

A real-time voice agent with sub-second response times
Natural conversation flow with AssemblyAI’s advanced turn detection model
Voice Activity Detection based interruption handling for responsive interactions
A complete setup ready for production deployment

Prerequisites

Python 3.10 or higher
A microphone and speakers/headphones
API keys for AssemblyAI, OpenAI, and Cartesia

Step 1: Installation

Create a new Python environment and install the required packages:

Create virtual environment

$ python -m venv voice-agent
> source voice-agent/bin/activate  # On Windows: voice-agent\\Scripts\\activate

Install Pipecat with all required plugins

$ pip install "pipecat-ai[assemblyai,openai,cartesia,silero,daily,webrtc]" python-dotenv fastapi uvicorn pipecat-ai-small-webrtc-prebuilt

Download the Pipecat run helper file:

Linux/macOS

Windows (PowerShell)

$ curl -O https://raw.githubusercontent.com/pipecat-ai/pipecat/9f223442c2799d22aac8a552c0af1d0ae7ff42c2/src/pipecat/examples/run.py

Step 2: Get API Keys

To build your voice agent, you’ll use:

AssemblyAI for STT (speech-to-text)
GPT-4o mini from OpenAI for the LLM (language model)
Cartesia for TTS (text-to-speech)

You’ll need API keys for each:

AssemblyAI (STT)

Sign up: assemblyai.com/signup
API Key: assemblyai.com/api-keys

OpenAI (LLM)

Sign up: auth.openai.com/create-account
API Key: platform.openai.com/api-keys

Cartesia (TTS)

Sign up: cartesia.ai/sign-up
API Key: cartesia.ai/keys

Looking for alternatives? Pipecat supports multiple TTS and LLM providers.
Explore the full plugin list here.

Step 3: Environment Setup

Create a .env file in your project directory with your API keys:

1 # API Keys
2 ASSEMBLYAI_API_KEY=your_assemblyai_api_key_here
3 OPENAI_API_KEY=your_openai_api_key_here
4 CARTESIA_API_KEY=your_cartesia_api_key_here

Replace the placeholder values with your actual API keys from Step 2.

Step 4: Create Your Voice Agent

Pipecat has many examples for testing, which you can see in their quickstart guide and all their examples on GitHub.

You can utilize AssemblyAI within any of these Pipecat examples as long as you use us in the Pipecat Pipeline (Pipecat’s code system for voice agents) shown here:

1 from pipecat.services.assemblyai.stt import AssemblyAISTTService
2 
3 # Configure service
4 stt = AssemblyAISTTService(
5     connection_params=AssemblyAIConnectionParams(
6         end_of_turn_confidence_threshold=0.7,
7         min_end_of_turn_silence_when_confident=160,
8         max_turn_silence=2400,
9     ),
10     api_key=os.getenv("ASSEMBLYAI_API_KEY"),
11     vad_force_turn_endpoint=False
12 )
13 
14 # Use in pipeline
15 pipeline = Pipeline([
16     transport.input(),
17     stt,
18     llm,
19     ...
20 ])

Below is a code snippet that uses our STT within Pipecat’s basic voice agent example structure. This example is great because it comes with a great UI.

Please create a file called voice_agent.py and copy and paste this code:

1 #
2 # Copyright (c) 2024–2025, Daily
3 #
4 # SPDX-License-Identifier: BSD 2-Clause License
5 #
6 
7 import argparse
8 import os
9 
10 from dotenv import load_dotenv
11 from loguru import logger
12 
13 from pipecat.audio.vad.silero import SileroVADAnalyzer
14 from pipecat.pipeline.pipeline import Pipeline
15 from pipecat.pipeline.runner import PipelineRunner
16 from pipecat.pipeline.task import PipelineParams, PipelineTask
17 from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
18 from pipecat.services.cartesia.tts import CartesiaTTSService
19 from pipecat.services.assemblyai.stt import AssemblyAISTTService, AssemblyAIConnectionParams
20 from pipecat.services.openai.llm import OpenAILLMService
21 from pipecat.transports.base_transport import BaseTransport, TransportParams
22 from pipecat.transports.network.fastapi_websocket import FastAPIWebsocketParams
23 from pipecat.transports.services.daily import DailyParams
24 
25 load_dotenv(override=True)
26 
27 # We store functions so objects (e.g. SileroVADAnalyzer) don't get
28 # instantiated. The function will be called when the desired transport gets
29 # selected.
30 transport_params = {
31     "daily": lambda: DailyParams(
32         audio_in_enabled=True,
33         audio_out_enabled=True,
34         vad_analyzer=SileroVADAnalyzer(),
35     ),
36     "twilio": lambda: FastAPIWebsocketParams(
37         audio_in_enabled=True,
38         audio_out_enabled=True,
39         vad_analyzer=SileroVADAnalyzer(),
40     ),
41     "webrtc": lambda: TransportParams(
42         audio_in_enabled=True,
43         audio_out_enabled=True,
44         vad_analyzer=SileroVADAnalyzer(),
45     ),
46 }
47 
48 
49 async def run_example(transport: BaseTransport, _: argparse.Namespace, handle_sigint: bool):
50     logger.info(f"Starting bot")
51 
52     # Configure AssemblyAI STT with advanced turn detection
53     stt = AssemblyAISTTService(
54         api_key=os.getenv("ASSEMBLYAI_API_KEY"),
55         vad_force_turn_endpoint=False,
56         connection_params=AssemblyAIConnectionParams(
57             end_of_turn_confidence_threshold=0.7,
58             min_end_of_turn_silence_when_confident=160,
59             max_turn_silence=2400,
60         )
61     )
62 
63     tts = CartesiaTTSService(
64         api_key=os.getenv("CARTESIA_API_KEY"),
65         voice_id="71a7ad14-091c-4e8e-a314-022ece01c121",  # British Reading Lady
66     )
67 
68     llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"))
69 
70     messages = [
71         {
72             "role": "system",
73             "content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
74         },
75     ]
76 
77     context = OpenAILLMContext(messages)
78     context_aggregator = llm.create_context_aggregator(context)
79 
80     pipeline = Pipeline(
81         [
82             transport.input(),  # Transport user input
83             stt,
84             context_aggregator.user(),  # User responses
85             llm,  # LLM
86             tts,  # TTS
87             transport.output(),  # Transport bot output
88             context_aggregator.assistant(),  # Assistant spoken responses
89         ]
90     )
91 
92     task = PipelineTask(
93         pipeline,
94         params=PipelineParams(
95             allow_interruptions=True,
96             enable_metrics=True,
97             enable_usage_metrics=True,
98             report_only_initial_ttfb=True,
99         ),
100     )
101 
102     @transport.event_handler("on_client_connected")
103     async def on_client_connected(transport, client):
104         logger.info(f"Client connected")
105         # Kick off the conversation.
106         messages.append({"role": "system", "content": "Please introduce yourself to the user."})
107         await task.queue_frames([context_aggregator.user().get_context_frame()])
108 
109     @transport.event_handler("on_client_disconnected")
110     async def on_client_disconnected(transport, client):
111         logger.info(f"Client disconnected")
112         await task.cancel()
113 
114     runner = PipelineRunner(handle_sigint=handle_sigint)
115 
116     await runner.run(task)
117 
118 
119 if __name__ == "__main__":
120     from pipecat.examples.run import main
121 
122     main(run_example, transport_params=transport_params)

Step 5: Run Your Voice Agent

Start your voice agent:

$ python voice_agent.py

You’ll see a URL (typically http://localhost:7860) in the console output. Open this URL in your browser to test your voice agent!

Step 6: Test Your Voice Agent

Open the provided URL in your browser (usually http://localhost:7860)
Allow microphone access when prompted by your browser
Click “Connect” to join the session
Start talking to your voice agent and have a conversation!

The web interface is provided automatically by Pipecat’s example framework, making testing simple and straightforward.

For more testing examples and advanced configurations, check out the Pipecat examples repository and the quickstart guide.

Configuration

Turn Detection (Key Feature)

AssemblyAI’s new turn detection model was built specifically for voice agents and you can tweak it to fit your use case. It processes both audio and linguistic information to determine an end of turn confidence score on every inference, and if that confidence score is past the set threshold, it triggers end of turn.

This custom model was designed to address 2 major issues with voice agents. With traditional VAD (voice activity detection) approaches based on silence alone, there are situations where the agent wouldn’t wait for a user to finish their turn even if the audio data suggested it. Think of a situation like “My credit card number is____” - if someone is looking that up, traditional VAD may not wait for the user, where our turn detection model is far better in these situations.

Additionally, in situations where we are certain that the user is done speaking like “What is my credit score?”, a high end of turn confidence is returned, greater than the threshold, and triggering end of turn, allowing for minimal turnaround latency in those scenarios.

1 stt = AssemblyAISTTService(
2     api_key=os.getenv("ASSEMBLYAI_API_KEY"),
3     vad_force_turn_endpoint=False,  # Use AssemblyAI's STT-based turn detection
4     connection_params=AssemblyAIConnectionParams(
5         end_of_turn_confidence_threshold=0.7,
6         min_end_of_turn_silence_when_confident=160,  # in ms
7         max_turn_silence=2400,  # in ms
8     )
9 )

Parameter tuning:

end_of_turn_confidence_threshold: Raise or lower the threshold based on how confident you’d like us to be before triggering end of turn based on confidence score
min_end_of_turn_silence_when_confident: Increase or decrease the amount of time we wait to trigger end of turn when confident
max_turn_silence: Lower or raise the amount of time needed to trigger end of turn when end of turn isn’t triggered by a high confidence score

You can also set vad_force_turn_endpoint=True if you’d like turn detection to be based on VAD instead of our advanced turn detection model.

For more information, see our Universal-Streaming end-of-turn detection guide and message-by-message breakdown.

Customizing your agent:

Modify the system message in the messages array:

1 messages = [
2     {
3         "role": "system",
4         "content": "You are a friendly customer service representative. Help users with technical questions and maintain a professional tone. Keep responses under 30 seconds.",
5     },
6 ]

This replaces the existing system message in the code. The current example uses:

1 messages = [
2     {
3         "role": "system",
4         "content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
5     },
6 ]

For customizing Cartesia TTS voices, see the Pipecat Cartesia TTS documentation.

For configuring OpenAI models and parameters, see the Pipecat OpenAI LLM documentation.

For complete details on all AssemblyAI parameters, see the AssemblyAI Universal-Streaming API Reference.

Want to build more advanced voice agents? Pipecat has guides on building custom voice agents and workflows. Explore their documentation to see how you can add custom processors, integrate with databases, or build multi-modal experiences.

Production Deployment

When your voice agent is working well in development, it’s time to deploy it to production. Pipecat has comprehensive deployment guides to help you get started - check out their deployment overview for detailed instructions.

Pipecat Cloud

Pipecat offers a managed cloud service for deploying voice agents at scale. Check out Pipecat Cloud for managed infrastructure that handles scaling, monitoring, and deployment automatically.

Self-Hosting

Since Pipecat is open source, you have complete control over your deployment and data. You can deploy on AWS, Google Cloud, Azure, or any other hosting platform. Consider using containerization with Docker for easier deployment and scaling.