Building a Voice Agent with Pipecat and AssemblyAI

Overview

Build a complete voice agent from scratch using Pipecat and AssemblyAI’s streaming speech-to-text with advanced turn detection. This guide walks you through creating a fully functional voice agent that can have natural conversations with users in real-time.

Pipecat is an open-source framework for building conversational AI applications, created by Daily.co. Daily.co is a platform that provides real-time video and audio APIs, and they built Pipecat to make it easier for developers to create AI-powered voice experiences. Pipecat provides the infrastructure for real-time voice interactions, handling the complex orchestration of speech-to-text (STT), large language models (LLM), and text-to-speech (TTS) services.

Pipecat specializes in building AI-powered voice agents and handles the real-time media processing pipeline, allowing you to focus on your agent’s behavior rather than the underlying technical complexity.

New to Pipecat? This guide assumes no prior Pipecat experience and walks you through everything from setup to deployment.

What you’ll build

By the end of this guide, you’ll have:

  • A real-time voice agent with sub-second response times
  • Natural conversation flow with AssemblyAI’s advanced turn detection model
  • Voice Activity Detection based interruption handling for responsive interactions
  • A complete setup ready for production deployment

Prerequisites

  • Python 3.10 or higher
  • A microphone and speakers/headphones
  • API keys for AssemblyAI, OpenAI, and Cartesia

Step 1: Installation

Create a new Python environment and install the required packages:

Create virtual environment

$python -m venv voice-agent
>source voice-agent/bin/activate # On Windows: voice-agent\\Scripts\\activate

Install Pipecat with all required plugins

$pip install "pipecat-ai[assemblyai,openai,cartesia,silero,daily,webrtc]" python-dotenv fastapi uvicorn pipecat-ai-small-webrtc-prebuilt

Download the Pipecat run helper file:

$curl -O https://raw.githubusercontent.com/pipecat-ai/pipecat/9f223442c2799d22aac8a552c0af1d0ae7ff42c2/src/pipecat/examples/run.py

Step 2: Get API Keys

To build your voice agent, you’ll use:

  • AssemblyAI for STT (speech-to-text)
  • GPT-4o mini from OpenAI for the LLM (language model)
  • Cartesia for TTS (text-to-speech)

You’ll need API keys for each:

AssemblyAI (STT)

Cartesia (TTS)

Looking for alternatives? Pipecat supports multiple TTS and LLM providers.
Explore the full plugin list here.

Step 3: Environment Setup

Create a .env file in your project directory with your API keys:

1# API Keys
2ASSEMBLYAI_API_KEY=your_assemblyai_api_key_here
3OPENAI_API_KEY=your_openai_api_key_here
4CARTESIA_API_KEY=your_cartesia_api_key_here

Replace the placeholder values with your actual API keys from Step 2.

Step 4: Create Your Voice Agent

Pipecat has many examples for testing, which you can see in their quickstart guide and all their examples on GitHub.

You can utilize AssemblyAI within any of these Pipecat examples as long as you use us in the Pipecat Pipeline (Pipecat’s code system for voice agents) shown here:

1from pipecat.services.assemblyai.stt import AssemblyAISTTService
2
3# Configure service
4stt = AssemblyAISTTService(
5 connection_params=AssemblyAIConnectionParams(
6 end_of_turn_confidence_threshold=0.7,
7 min_end_of_turn_silence_when_confident=160,
8 max_turn_silence=2400,
9 ),
10 api_key=os.getenv("ASSEMBLYAI_API_KEY"),
11 vad_force_turn_endpoint=False
12)
13
14# Use in pipeline
15pipeline = Pipeline([
16 transport.input(),
17 stt,
18 llm,
19 ...
20])

Below is a code snippet that uses our STT within Pipecat’s basic voice agent example structure. This example is great because it comes with a great UI.

Please create a file called voice_agent.py and copy and paste this code:

1#
2# Copyright (c) 2024–2025, Daily
3#
4# SPDX-License-Identifier: BSD 2-Clause License
5#
6
7import argparse
8import os
9
10from dotenv import load_dotenv
11from loguru import logger
12
13from pipecat.audio.vad.silero import SileroVADAnalyzer
14from pipecat.pipeline.pipeline import Pipeline
15from pipecat.pipeline.runner import PipelineRunner
16from pipecat.pipeline.task import PipelineParams, PipelineTask
17from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
18from pipecat.services.cartesia.tts import CartesiaTTSService
19from pipecat.services.assemblyai.stt import AssemblyAISTTService, AssemblyAIConnectionParams
20from pipecat.services.openai.llm import OpenAILLMService
21from pipecat.transports.base_transport import BaseTransport, TransportParams
22from pipecat.transports.network.fastapi_websocket import FastAPIWebsocketParams
23from pipecat.transports.services.daily import DailyParams
24
25load_dotenv(override=True)
26
27# We store functions so objects (e.g. SileroVADAnalyzer) don't get
28# instantiated. The function will be called when the desired transport gets
29# selected.
30transport_params = {
31 "daily": lambda: DailyParams(
32 audio_in_enabled=True,
33 audio_out_enabled=True,
34 vad_analyzer=SileroVADAnalyzer(),
35 ),
36 "twilio": lambda: FastAPIWebsocketParams(
37 audio_in_enabled=True,
38 audio_out_enabled=True,
39 vad_analyzer=SileroVADAnalyzer(),
40 ),
41 "webrtc": lambda: TransportParams(
42 audio_in_enabled=True,
43 audio_out_enabled=True,
44 vad_analyzer=SileroVADAnalyzer(),
45 ),
46}
47
48
49async def run_example(transport: BaseTransport, _: argparse.Namespace, handle_sigint: bool):
50 logger.info(f"Starting bot")
51
52 # Configure AssemblyAI STT with advanced turn detection
53 stt = AssemblyAISTTService(
54 api_key=os.getenv("ASSEMBLYAI_API_KEY"),
55 vad_force_turn_endpoint=False,
56 connection_params=AssemblyAIConnectionParams(
57 end_of_turn_confidence_threshold=0.7,
58 min_end_of_turn_silence_when_confident=160,
59 max_turn_silence=2400,
60 )
61 )
62
63 tts = CartesiaTTSService(
64 api_key=os.getenv("CARTESIA_API_KEY"),
65 voice_id="71a7ad14-091c-4e8e-a314-022ece01c121", # British Reading Lady
66 )
67
68 llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"))
69
70 messages = [
71 {
72 "role": "system",
73 "content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
74 },
75 ]
76
77 context = OpenAILLMContext(messages)
78 context_aggregator = llm.create_context_aggregator(context)
79
80 pipeline = Pipeline(
81 [
82 transport.input(), # Transport user input
83 stt,
84 context_aggregator.user(), # User responses
85 llm, # LLM
86 tts, # TTS
87 transport.output(), # Transport bot output
88 context_aggregator.assistant(), # Assistant spoken responses
89 ]
90 )
91
92 task = PipelineTask(
93 pipeline,
94 params=PipelineParams(
95 allow_interruptions=True,
96 enable_metrics=True,
97 enable_usage_metrics=True,
98 report_only_initial_ttfb=True,
99 ),
100 )
101
102 @transport.event_handler("on_client_connected")
103 async def on_client_connected(transport, client):
104 logger.info(f"Client connected")
105 # Kick off the conversation.
106 messages.append({"role": "system", "content": "Please introduce yourself to the user."})
107 await task.queue_frames([context_aggregator.user().get_context_frame()])
108
109 @transport.event_handler("on_client_disconnected")
110 async def on_client_disconnected(transport, client):
111 logger.info(f"Client disconnected")
112 await task.cancel()
113
114 runner = PipelineRunner(handle_sigint=handle_sigint)
115
116 await runner.run(task)
117
118
119if __name__ == "__main__":
120 from pipecat.examples.run import main
121
122 main(run_example, transport_params=transport_params)

Step 5: Run Your Voice Agent

Start your voice agent:

$python voice_agent.py

You’ll see a URL (typically http://localhost:7860) in the console output. Open this URL in your browser to test your voice agent!

Step 6: Test Your Voice Agent

  1. Open the provided URL in your browser (usually http://localhost:7860)
  2. Allow microphone access when prompted by your browser
  3. Click “Connect” to join the session
  4. Start talking to your voice agent and have a conversation!

The web interface is provided automatically by Pipecat’s example framework, making testing simple and straightforward.

For more testing examples and advanced configurations, check out the Pipecat examples repository and the quickstart guide.

Configuration

Turn Detection (Key Feature)

AssemblyAI’s new turn detection model was built specifically for voice agents and you can tweak it to fit your use case. It processes both audio and linguistic information to determine an end of turn confidence score on every inference, and if that confidence score is past the set threshold, it triggers end of turn.

This custom model was designed to address 2 major issues with voice agents. With traditional VAD (voice activity detection) approaches based on silence alone, there are situations where the agent wouldn’t wait for a user to finish their turn even if the audio data suggested it. Think of a situation like “My credit card number is____” - if someone is looking that up, traditional VAD may not wait for the user, where our turn detection model is far better in these situations.

Additionally, in situations where we are certain that the user is done speaking like “What is my credit score?”, a high end of turn confidence is returned, greater than the threshold, and triggering end of turn, allowing for minimal turnaround latency in those scenarios.

1stt = AssemblyAISTTService(
2 api_key=os.getenv("ASSEMBLYAI_API_KEY"),
3 vad_force_turn_endpoint=False, # Use AssemblyAI's STT-based turn detection
4 connection_params=AssemblyAIConnectionParams(
5 end_of_turn_confidence_threshold=0.7,
6 min_end_of_turn_silence_when_confident=160, # in ms
7 max_turn_silence=2400, # in ms
8 )
9)

Parameter tuning:

  • end_of_turn_confidence_threshold: Raise or lower the threshold based on how confident you’d like us to be before triggering end of turn based on confidence score
  • min_end_of_turn_silence_when_confident: Increase or decrease the amount of time we wait to trigger end of turn when confident
  • max_turn_silence: Lower or raise the amount of time needed to trigger end of turn when end of turn isn’t triggered by a high confidence score

You can also set vad_force_turn_endpoint=True if you’d like turn detection to be based on VAD instead of our advanced turn detection model.

For more information, see our Universal-Streaming end-of-turn detection guide and message-by-message breakdown.

Customizing your agent:

Modify the system message in the messages array:

1messages = [
2 {
3 "role": "system",
4 "content": "You are a friendly customer service representative. Help users with technical questions and maintain a professional tone. Keep responses under 30 seconds.",
5 },
6]

This replaces the existing system message in the code. The current example uses:

1messages = [
2 {
3 "role": "system",
4 "content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
5 },
6]

For customizing Cartesia TTS voices, see the Pipecat Cartesia TTS documentation.

For configuring OpenAI models and parameters, see the Pipecat OpenAI LLM documentation.

For complete details on all AssemblyAI parameters, see the AssemblyAI Universal-Streaming API Reference.

Want to build more advanced voice agents? Pipecat has guides on building custom voice agents and workflows. Explore their documentation to see how you can add custom processors, integrate with databases, or build multi-modal experiences.

Production Deployment

When your voice agent is working well in development, it’s time to deploy it to production. Pipecat has comprehensive deployment guides to help you get started - check out their deployment overview for detailed instructions.

Pipecat Cloud

Pipecat offers a managed cloud service for deploying voice agents at scale. Check out Pipecat Cloud for managed infrastructure that handles scaling, monitoring, and deployment automatically.

Self-Hosting

Since Pipecat is open source, you have complete control over your deployment and data. You can deploy on AWS, Google Cloud, Azure, or any other hosting platform. Consider using containerization with Docker for easier deployment and scaling.

More Questions?

If you get stuck, or have any other questions, we’d love to help you out. Contact our support team at support@assemblyai.com or create a support ticket.