Build a voice agent with LiveKit
This tutorial walks you through building a complete voice agent using LiveKit Agents as your orchestration framework and AssemblyAI's Universal-3 Pro Streaming model as the speech-to-text layer.



Voice agents are no longer demo territory. Developers are shipping them to production—for customer support, sales coaching, medical intake, and anything else where a keyboard gets in the way. The challenge isn't the idea. It's wiring up the pipeline without it becoming a second full-time job.
This tutorial walks you through building a complete voice agent using LiveKit Agents as your orchestration framework and AssemblyAI's Universal-3 Pro Streaming model as the speech-to-text layer. You'll wire in OpenAI GPT-4o for the LLM and Cartesia for text-to-speech, run it locally first, and then connect it to the LiveKit Agents Playground.
If you've never touched LiveKit before, that's fine. We'll cover what you need to know.
What is LiveKit?
LiveKit is an open-source real-time communication platform built on top of WebRTC. It handles the hard parts of real-time audio and video—signaling, media routing, scaling—so you can focus on your application logic instead.
At the center of every LiveKit application is a room. Participants join rooms and publish audio or video tracks. Other participants subscribe to those tracks. A LiveKit server (either self-hosted or via LiveKit Cloud) acts as the Selective Forwarding Unit that routes media between participants without requiring peer-to-peer connections at scale.
LiveKit Agents is the framework layer built on top of this infrastructure specifically for AI-powered voice and video agents. It handles orchestration between STT, LLM, and TTS services—you configure the components, and LiveKit Agents manages the flow. Your agent joins a room like any other participant, listens to audio tracks, and responds.
The result: you write agent logic, not media plumbing.
The voice agent pipeline
Voice agents follow a three-step cascading pipeline:
- Speech-to-text (STT): The user speaks. Audio streams into your STT model, which returns a transcript in real time.
- LLM: The transcript goes to a language model, which generates a response.
- Text-to-speech (TTS): The LLM's response gets converted to audio and streamed back to the user.
Here's what that looks like in this stack:
WebRTC (LiveKit room)
↓
LiveKit Cloud
↓
AssemblyAI Universal-3 Pro Streaming ← speech-to-text
↓ transcript + neural turn signal
OpenAI GPT-4o ← LLM
↓ text response
Cartesia Sonic ← text-to-speech
↓ audio
Back to LiveKit room
The key variable in this pipeline is the STT layer. If the transcript is wrong, the LLM responds to the wrong input. That's why the STT choice matters more than most people realize—and it's why we're using Universal-3 Pro Streaming here instead of the default.
Why Universal-3 Pro Streaming for STT?
The LiveKit AssemblyAI plugin supports multiple models. The default is universal-streaming—fast, solid for English-only use cases. But for production voice agents, you want u3-rt-pro.
The difference shows up in two places: accuracy and turn detection.
Accuracy
Real voice agent conversations involve email addresses, phone numbers, account numbers, URLs—the kind of structured entities that most STT models routinely mangle. Universal-3 Pro Streaming was built with this in mind. Benchmarks from Hamming.ai across 4M+ production calls show a significant accuracy gap:
Turn detection
This is where Universal-3 Pro Streaming has the clearest edge for voice agents. Most STT models use voice activity detection (VAD)—essentially, if there's silence for long enough, assume the user is done talking. It works, but it's blunt. It triggers on mid-sentence pauses. It waits too long when someone trails off.
Universal-3 Pro Streaming uses neural turn detection: it combines acoustic signals and linguistic signals to understand whether a pause is a mid-sentence breath or an actual end-of-turn. The result is fewer false triggers, faster response times, and a conversation that doesn't feel like it's fighting you.
At $0.45/hr, it's the STT layer your voice agent should be running on.
Prerequisites
- Python 3.11+
- A microphone and speakers (for local testing)
- API keys for:
- AssemblyAI — free tier available at assemblyai.com
- LiveKit Cloud — free tier available at livekit.io
- OpenAI
- Cartesia
Step 1: Installation
Create a virtual environment and install the dependencies:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install "livekit-agents[assemblyai,silero,codecs]~=1.0"
python-dotenv
You'll also need plugins for your LLM and TTS providers. For GPT-4o and Cartesia:
pip install "livekit-agents[openai,cartesia]~=1.0"
One note: Universal-3 Pro Streaming support was added in livekit-agents@1.4.4. If you're on an older version, the u3-rt-pro model ID will throw a validation error. Run pip install --upgrade "livekit-agents~=1.0" if you hit that.
Step 2: Configure your API keys
Create a .env file in your project root:
LIVEKIT_URL=wss://your-project.livekit.cloud
LIVEKIT_API_KEY=your_livekit_api_key
LIVEKIT_API_SECRET=your_livekit_api_secret
ASSEMBLYAI_API_KEY=your_assemblyai_key
OPENAI_API_KEY=your_openai_key
CARTESIA_API_KEY=your_cartesia_key
Get your AssemblyAI API key from the AssemblyAI dashboard under API Keys. Get your LiveKit URL, API key, and secret from your LiveKit Cloud project settings.
Step 3: Build the agent
Create a file called agent.py and add the following:
from dotenv import load_dotenv
from livekit import agents
from livekit.agents import AgentSession, Agent
from livekit.plugins import (
assemblyai,
cartesia,
openai,
silero,
)
load_dotenv()
class Assistant(Agent):
def __init__(self) -> None:
super().__init__(instructions="You are a helpful voice AI
assistant.")
async def entrypoint(ctx: agents.JobContext):
await ctx.connect()
session = AgentSession(
stt=assemblyai.STT(
model="u3-rt-pro",
min_turn_silence=100,
max_turn_silence=1000, # Higher ceiling needed when
turn_detection="stt"
vad_threshold=0.3, # Match Silero's
activation_threshold below
),
llm=openai.LLM(model="gpt-4o"),
tts=cartesia.TTS(),
vad=silero.VAD.load(
activation_threshold=0.3,
),
turn_detection="stt",
min_endpointing_delay=0, # Avoid additive delay in STT mode
)
await session.start(
room=ctx.room,
agent=Assistant(),
)
await session.generate_reply(
instructions="Greet the user and offer your assistance."
)
if __name__ == "__main__":
agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint
))
Let's walk through what's happening here.
The Assistant class
Agent is where you define your agent's personality and behavior. The instructions string is the system prompt—it goes directly to the LLM. This is where you'd define your agent's persona, rules, and any context it needs to do its job.
The AgentSession
This is the core of your agent. You're wiring together four components:
- stt: AssemblyAI's Universal-3 Pro Streaming model. We're setting model="u3-rt-pro" explicitly, along with three turn detection parameters (more on those below).
- llm: OpenAI GPT-4o. LiveKit's OpenAI plugin is straightforward—swap the model string to use GPT-4o mini or any other OpenAI-compatible model.
- tts: Cartesia. The default voice works well out of the box; you can configure specific voices via the Cartesia plugin options.
- vad: Silero VAD. This handles voice activity detection for interruption handling. The activation_threshold is set to 0.3 to match Universal-3 Pro Streaming's internal VAD threshold—keeping them aligned means fewer edge cases where one system thinks the user is speaking and the other doesn't.
The turn_detection="stt" setting
This is the key configuration line. Setting turn_detection="stt" tells LiveKit Agents to use Universal-3 Pro Streaming's neural turn detection instead of LiveKit's own turn detector. This is the recommended mode for Universal-3 Pro Streaming—it's what the model was built for. When you use it, set min_endpointing_delay=0 to avoid stacking delay on top of the STT's already-fast response.
The entrypoint function
This runs when a new job comes in. ctx.connect() joins the LiveKit room. Then we start the session and fire off an initial greeting. The agent keeps running until the room closes or the session ends.
Step 4: Run it
There are two modes for running locally.
Console mode
Console mode lets you talk to the agent directly from your terminal—no LiveKit Cloud connection needed. It's the fastest way to test.
python agent.py consoleSpeak into your microphone. You'll see transcripts printed in real time, and the agent's responses will play through your speakers.
Dev mode
Dev mode connects your agent to LiveKit Cloud and makes it available through the Agents Playground:
python agent.py dev
Then open agents-playground.livekit.io, enter your LiveKit URL and API key, and click Connect. A room will be created automatically, your agent will join it, and you can start a real conversation through the browser UI.
Tuning turn detection
Universal-3 Pro Streaming's neural turn detection is controlled by three parameters. The defaults work well for general conversational agents, but you'll likely want to tune them for your specific use case.
stt=assemblyai.STT(
model="u3-rt-pro",
end_of_turn_confidence_threshold=0.4, # 0.0–1.0; how
confident before declaring turn end
min_turn_silence=300, # ms of silence before
turn end after this silence
)
Here's how to think about each one:
- end_of_turn_confidence_threshold: Lower means the agent responds faster (and risks cutting the user off). Higher means fewer false triggers on noisy lines. For call center environments, raise it to 0.6.
- min_turn_silence: The minimum silence window before the model checks for end-of-turn. Lower this to 200 for fast-paced back-and-forth. For general conversational chat, 300 is a good baseline.
- max_turn_silence: The hard ceiling. After this much silence, the turn ends regardless. Raise to 2000 for healthcare conversations or users who speak more deliberately.
You can also update these parameters mid-session without reconnecting—useful for use cases where the conversation shifts context:
# Increase while user is dictating an address or account number
stt.update_options(max_turn_silence=3000)
# Later, reset to baseline
stt.update_options(max_turn_silence=1200)
Enabling keyterm prompting
If your agent handles domain-specific vocabulary—product names, medical terms, financial jargon, brand names—you can boost recognition accuracy with keyterm prompting. This doesn't require a restart and takes effect immediately.
# After session.start():
await session.stt.update_options(
keyterms_prompt=["AssemblyAI", "Universal-3 Pro", "LiveKit"]
)
You can provide up to 1,000 terms, each up to 50 characters. This is especially useful when your agent is handling structured data like account numbers, medication names, or product SKUs—exactly the kind of entities that generic STT models get wrong most often.
Swapping components
One of the cleaner aspects of LiveKit Agents is how easy it is to swap providers. The abstract interfaces for STT, LLM, and TTS are consistent—you change the plugin, not the surrounding code.
To swap the LLM from GPT-4o to Claude Sonnet:
from livekit.plugins import anthropic
session = AgentSession(
stt=assemblyai.STT(model="u3-rt-pro", ...),
llm=anthropic.LLM(model="claude-sonnet-4-5"),
tts=cartesia.TTS(),
...
)
To swap TTS from Cartesia to ElevenLabs:
from livekit.plugins import elevenlabs
tts=elevenlabs.TTS(voice_id="your-voice-id")
The STT layer is the one component you don't want to casually swap out—that's where your accuracy comes from. But the rest of the pipeline is genuinely modular.
Next steps
You now have a working voice agent running Universal-3 Pro Streaming. A few directions from here:
- Deploy to LiveKit Cloud: Run python agent.py start to register your agent as a persistent worker that handles incoming jobs automatically.
- Add tool calling: LiveKit Agents supports function calling through the LLM layer. Give your agent the ability to look things up, take actions, or connect to your backend.
- Enable speaker diarization: For multi-party conversations, Universal-3 Pro Streaming supports real-time speaker identification as a per-session toggle—no extra configuration needed.
- Add telephony: If you want your agent to handle phone calls, LiveKit has native SIP support. Pair it with Telnyx or Twilio and your agent is live on a phone number.
The full parameters reference for Universal-3 Pro Streaming on LiveKit is at assemblyai.com/docs/voice-agents/livekit-u3-rt-pro.
Frequently asked questions
What is LiveKit Agents and how does it work for building voice agents?
LiveKit Agents is a framework within the LiveKit platform for building AI-powered voice and video agents. It handles the orchestration between a speech-to-text model, a large language model, and a text-to-speech model—so you configure the components and LiveKit manages the real-time audio routing, turn flow, and session lifecycle. Your agent joins a LiveKit room as a participant, listens to audio tracks published by users, runs audio through the STT → LLM → TTS pipeline, and streams audio responses back. It's built on LiveKit's WebRTC infrastructure and supports both self-hosted and LiveKit Cloud deployments.
What's the difference between Universal-3 Pro Streaming and Universal-Streaming for voice agents?
Universal-3 Pro Streaming (u3-rt-pro) is optimized for production voice agent workflows, while Universal-Streaming is the faster, lower-cost model designed for English-only use cases. The key differences are accuracy and turn detection: Universal-3 Pro Streaming uses neural turn detection—combining acoustic and linguistic signals—instead of silence-based VAD, which means it better distinguishes mid-sentence pauses from actual end-of-turns. It also delivers significantly better accuracy on structured entities like email addresses, phone numbers, and account numbers. Universal-3 Pro Streaming costs $0.45/hr and supports six languages; Universal-Streaming is $0.15/hr and English-only.
What is the best streaming speech-to-text API for building voice agents with LiveKit?
AssemblyAI's Universal-3 Pro Streaming model is the recommended choice for LiveKit voice agents. It has a native one-line integration with LiveKit Agents via the assemblyai plugin, delivers 307ms P50 latency (versus 516ms for Deepgram Nova-3), and includes neural turn detection that LiveKit can use directly by setting turn_detection="stt". Benchmarks from Hamming.ai across 4M+ production calls show Universal-3 Pro Streaming achieves 8.14% word error rate and 21% fewer alphanumeric errors than Deepgram Nova-3, with anti-hallucination built in.
How does neural turn detection in Universal-3 Pro Streaming work, and why does it matter?
Neural turn detection uses both acoustic signals (voice energy, pitch, cadence) and linguistic signals (sentence completion, punctuation patterns) to determine when a speaker has finished their turn—rather than simply waiting for silence. In practice, VAD-based detection creates false triggers on mid-sentence pauses and can make an agent feel like it's constantly interrupting. Universal-3 Pro Streaming's approach means fewer false triggers on noisy lines and snappier responses in natural conversation. You tune the behavior with three parameters—end_of_turn_confidence_threshold, min_turn_silence, and max_turn_silence—and can adjust them mid-session without reconnecting.
How much does it cost to build a voice agent using AssemblyAI Universal-3 Pro Streaming with LiveKit?
AssemblyAI's Universal-3 Pro Streaming model is priced at $0.45/hr, billed based on total session duration with no minimum commitments, no rate limits, and unlimited concurrent sessions included. You'll also need to budget for your LLM (e.g., OpenAI GPT-4o) and TTS provider (e.g., Cartesia) separately, as those are billed by their respective providers. AssemblyAI offers a free tier to get started with no credit card required, and volume discounts are available for high-usage workloads.
Can I swap out the LLM or TTS components in a LiveKit voice agent?
Yes—LiveKit Agents is designed around modular, swappable components. The AgentSession takes separate stt, llm, and tts parameters, each backed by a consistent plugin interface. You can swap OpenAI GPT-4o for Anthropic Claude or any OpenAI-compatible model without changing the surrounding code; similarly, Cartesia can be replaced with ElevenLabs, OpenAI TTS, or any other supported provider. The STT layer is the one you shouldn't swap casually—accuracy at the input directly affects LLM response quality, since a wrong transcript produces a wrong answer.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.



