Build a Daily.co voice agent with AssemblyAI's Voice Agent API
A server-side Daily.co bot, written in Python, that joins a WebRTC room as a participant, listens to whoever's talking, and replies with a real voice — powered end-to-end by AssemblyAI's Voice Agent API. One Daily room. One AssemblyAI WebSocket. No separate STT, LLM, or TTS providers to wire up.



Daily.co handles the WebRTC plumbing — rooms, participants, audio tracks, NAT traversal — across web, mobile, and SIP. AssemblyAI's Voice Agent API handles the AI: speech recognition, the LLM that decides what to say, and the voice that speaks it back, all over a single connection. This tutorial bridges the two with the daily-python SDK.
Why combine Daily.co with the Voice Agent API
Most voice-agent stacks pick a transport and then bolt on a pipeline of AI services behind it. With Daily.co + the Voice Agent API, both halves collapse to a single managed dependency each.
Architecture
The system has three layers:
The bot resamples between Daily's 16 kHz and the Voice Agent API's 24 kHz. Both sides use PCM16 mono.
Prerequisites
- Python 3.10+
- A Daily.co account with an API key — free tier includes 10,000 participant minutes/month
- An AssemblyAI API key — free tier available, no credit card
- A Daily.co room URL — create in the dashboard or with the script shown below
Quick start
1. Clone and install
git clone https://github.com/kelsey-aai/voice-agent-daily
cd voice-agent-daily
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt2. Configure your keys
cp .env.example .envFill in .env:
ASSEMBLYAI_API_KEY=... # https://www.assemblyai.com/dashboard/signup
DAILY_API_KEY=... # https://dashboard.daily.co/developers
DAILY_ROOM_URL=https://yourname.daily.co/voice-agent3. Create a daily room (Optional)
If you don't have a room URL, create one via the Daily REST API:
import os, requests
from dotenv import load_dotenv
load_dotenv()
HEADERS = {"Authorization": f"Bearer {os.environ['DAILY_API_KEY']}"}
r = requests.post(
"https://api.daily.co/v1/rooms",
headers=HEADERS,
json={"properties": {"enable_prejoin_ui": False, "exp": 3600}},
)
r.raise_for_status()
print("Room URL:", r.json()["url"])4. Run the bot
python bot.pyOpen the same room URL in a browser. Speak — you'll see your transcript and the agent's reply stream to the terminal. The agent's voice plays back through the room.
How it works
The whole bot is in bot.py, in four pieces.
1. Initialize Daily and join the room
import daily
daily.Daily.init()
client = daily.CallClient(event_handler=self)
client.join(
DAILY_ROOM_URL,
meeting_token=DAILY_TOKEN,
client_settings=daily.ClientSettings(
inputs=daily.InputSettings(
microphone=daily.MicrophoneSettings(
is_enabled=True, device_id="vaa-mic",
),
camera=daily.CameraSettings(is_enabled=False),
),
publishing=daily.PublishingSettings(
microphone=daily.MicrophonePublishingSettings(is_enabled=True),
),
),
)Three things happen: daily.Daily.init() boots the WebRTC stack. CallClient is the per-room handle. The client_settings say: don't open a real mic or camera, but publish a virtual mic named vaa-mic where we'll write the agent's reply audio.
2. Subscribe to remote audio
client.update_subscriptions(
participant_settings={
"*": daily.ParticipantSubscriptionSettings(
media=daily.MediaSubscription.SUBSCRIBED_ALL,
)
}
)
client.set_participant_audio_renderer(
participant_id,
callback=self.on_audio_data,
sample_rate=16_000,
num_channels=1,
)update_subscriptions tells Daily to pull all media from every remote participant. set_participant_audio_renderer converts the remote audio track into mono PCM16 at 16 kHz and calls your callback for each chunk.
3. Bridge to the Voice Agent API
async with websockets.connect(
"wss://agents.assemblyai.com/v1/ws",
additional_headers={"Authorization": f"Bearer {ASSEMBLYAI_API_KEY}"},
) as ws:
await ws.send(json.dumps({
"type": "session.update",
"session": {
"system_prompt": "You are a friendly voice assistant.",
"greeting": "Hi - I'm joining the call.",
"input": {"format": {"encoding": "audio/pcm"}},
"output": {"voice": "ivy", "format": {"encoding": "audio/pcm"}},
},
}))
After session.ready, every chunk of room audio gets resampled to 24 kHz, base64-encoded, and shipped as input.audio.
4. Publish reply audio back into the room
elif t == "reply.audio":
pcm24 = base64.b64decode(event["data"])
pcm_out = resample_pcm16(pcm24, 24_000, 16_000)
await asyncio.to_thread(mic_device.write_frames, pcm_out)
mic_device is the virtual microphone registered with daily.Daily.create_microphone_device(...). Anything you write into it gets published into the Daily room as if a real human's mic produced it.
Interruption (Barge-In) handling
When a participant speaks while the agent is replying, the Voice Agent API emits reply.done with status: "interrupted". The bot doesn't need to flush anything on the Daily side — mic_device.write_frames only plays what you hand it, so stopping writes stops the agent.
elif t == "reply.done" and event.get("status") == "interrupted":
pending_tools.clear()Tuning
Pick a different voice
"output": {"voice": "james"} # conversational US male
"output": {"voice": "sophie"} # UK female
"output": {"voice": "diego"} # Latin American Spanish
"output": {"voice": "arjun"} # Hindi/HinglishBrowse the Voices catalog. Multilingual voices code-switch automatically.
Tune the system prompt
"session": {
"system_prompt": (
"You are a sales-qualifying agent for Acme Corp on a video call. "
"Ask one question at a time. Keep answers under two short sentences."
),
"greeting": "Hey - thanks for hopping on. What brings you in today?",Multi-participant calls
update_subscriptions with "*" subscribes the bot to every remote participant. The audio renderer fires per-participant. Today the bot mixes everyone into a single stream — for cleaner handling, either subscribe only to a designated primary participant or tag transcripts with the participant ID at the bridge.
Telephony in the same room
Daily supports SIP and PSTN dial-in, so phone callers can join the same room. Daily transcodes the carrier's 8 kHz G.711 audio for you — your renderer callback still gets PCM16 at whatever sample rate you asked for. No extra code.
Troubleshooting
Bot connects but never speaks. Check that MicrophonePublishingSettings(is_enabled=True) is set, and that you're calling mic_device.write_frames(...) with non-empty bytes.
Agent transcript is garbled. Wrong sample rate. The API expects 24 kHz for audio/pcm, not 16 kHz. Confirm your resample step is running.
Agent keeps interrupting itself. Two causes: (1) The bot is subscribing to its own virtual mic — filter self.client.local_participant().id. (2) Real humans without echo cancellation — plug in headphones.
UNAUTHORIZED close on the Voice Agent API socket. Bad or missing AssemblyAI key — check .env.
Daily join fails with 403. Room URL is wrong, the room expired, or it requires a meeting token.
Full troubleshooting guide: Voice Agent API docs.
Frequently asked questions
What is AssemblyAI's Voice Agent API?
A single WebSocket endpoint that handles the entire voice-agent pipeline server-side: speech recognition, LLM reasoning, and TTS. You send PCM audio in and get PCM audio back, with neural turn detection, barge-in, and tool calling built in.
How does the Daily.co Python SDK send audio to a voice agent?
The daily-python SDK exposes per-participant audio renderers via set_participant_audio_renderer. Daily decodes the remote WebRTC track to mono PCM16 and invokes your callback. You forward those bytes as input.audio events to the Voice Agent API.
What sample rate does it use?
The Voice Agent API defaults to 24 kHz PCM16 in both directions. Daily's renderer delivers 16 kHz. The bot resamples 16 kHz ↔ 24 kHz at the bridge.
How do I publish reply audio back into a Daily room?
Register a virtual mic with daily.Daily.create_microphone_device(...), enable it on join, and call mic_device.write_frames(pcm_bytes) with each reply chunk. Daily publishes it like a normal participant's mic.
Can the bot handle multiple humans?
Yes. update_subscriptions(participant_settings={"*": ...}) subscribes to everyone. The audio renderer fires per-participant so you know who's speaking.
Does this work with phone callers?
Yes. Daily supports SIP/PSTN dial-in. Daily transcodes the carrier audio for you — no extra code on your side.
How much does it cost?
AssemblyAI offers a free tier. Daily offers 10,000 free participant minutes/month. See the AssemblyAI pricing page.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.


