May 7, 2026

Build a Daily.co voice agent with AssemblyAI's Voice Agent API

A server-side Daily.co bot, written in Python, that joins a WebRTC room as a participant, listens to whoever's talking, and replies with a real voice — powered end-to-end by AssemblyAI's Voice Agent API. One Daily room. One AssemblyAI WebSocket. No separate STT, LLM, or TTS providers to wire up.

Kelsey Foster

Growth

AI voice agents

Voice Agent API

Reviewed by

Table of contents

[Visible on live site]

Daily.co handles the WebRTC plumbing — rooms, participants, audio tracks, NAT traversal — across web, mobile, and SIP. AssemblyAI's Voice Agent API handles the AI: speech recognition, the LLM that decides what to say, and the voice that speaks it back, all over a single connection. This tutorial bridges the two with the daily-python SDK.

Why combine Daily.co with the Voice Agent API

Most voice-agent stacks pick a transport and then bolt on a pipeline of AI services behind it. With Daily.co + the Voice Agent API, both halves collapse to a single managed dependency each.

	Self-hosted WebRTC + multi-vendor AI	Daily.co + Voice Agent API
Transport infrastructure	Run your own SFU/TURN servers	Daily.co cloud rooms
Cross-platform clients	Build native WebRTC clients per platform	Daily JS, React Native, iOS, Android, SIP — all built in
AI services to wire	Streaming STT + LLM + TTS (3+ vendors)	One WebSocket
API keys	4+ (TURN + 3 AI services)	2 (Daily + AssemblyAI)
Round trips per turn	4 (WebRTC→STT→LLM→TTS→WebRTC)	2 (WebRTC→agent→WebRTC)
Turn detection / barge-in	Implement yourself	Built into the agent
Tool calling	Wire LLM tools manually	Built in

Architecture

The system has three layers:

Parameter	Type	Description
`vad_threshold`	0.0–1.0	Voice activity detection sensitivity. Raise for noisy call environments.
`min_silence`	ms	Minimum silence before a confident end-of-turn check fires.
`max_silence`	ms	Hard cap on silence before forcing end-of-turn. Raise for deliberate speakers.
`interrupt_response`	boolean	Set to `False` to disable barge-in entirely.

The bot resamples between Daily's 16 kHz and the Voice Agent API's 24 kHz. Both sides use PCM16 mono.

Prerequisites

Python 3.10+
A Daily.co account with an API key — free tier includes 10,000 participant minutes/month
An AssemblyAI API key — free tier available, no credit card
A Daily.co room URL — create in the dashboard or with the script shown below

Quick start

1. Clone and install

git clone https://github.com/kelsey-aai/voice-agent-daily
cd voice-agent-daily
 
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

2. Configure your keys

cp .env.example .env

Fill in .env:

ASSEMBLYAI_API_KEY=...   # https://www.assemblyai.com/dashboard/signup
DAILY_API_KEY=...        # https://dashboard.daily.co/developers
DAILY_ROOM_URL=https://yourname.daily.co/voice-agent

3. Create a daily room (Optional)

If you don't have a room URL, create one via the Daily REST API:

import os, requests
from dotenv import load_dotenv
 
load_dotenv()
HEADERS = {"Authorization": f"Bearer {os.environ['DAILY_API_KEY']}"}
 
r = requests.post(
    "https://api.daily.co/v1/rooms",
    headers=HEADERS,
    json={"properties": {"enable_prejoin_ui": False, "exp": 3600}},
)
r.raise_for_status()
print("Room URL:", r.json()["url"])

4. Run the bot

python bot.py

Open the same room URL in a browser. Speak — you'll see your transcript and the agent's reply stream to the terminal. The agent's voice plays back through the room.

How it works

The whole bot is in bot.py, in four pieces.

1. Initialize Daily and join the room

import daily
 
daily.Daily.init()
client = daily.CallClient(event_handler=self)
 
client.join(
    DAILY_ROOM_URL,
    meeting_token=DAILY_TOKEN,
    client_settings=daily.ClientSettings(
        inputs=daily.InputSettings(
            microphone=daily.MicrophoneSettings(
                is_enabled=True, device_id="vaa-mic",
            ),
            camera=daily.CameraSettings(is_enabled=False),
        ),
        publishing=daily.PublishingSettings(
            microphone=daily.MicrophonePublishingSettings(is_enabled=True),
        ),
    ),
)

Three things happen: daily.Daily.init() boots the WebRTC stack. CallClient is the per-room handle. The client_settings say: don't open a real mic or camera, but publish a virtual mic named vaa-mic where we'll write the agent's reply audio.

2. Subscribe to remote audio

client.update_subscriptions(
    participant_settings={
        "*": daily.ParticipantSubscriptionSettings(
            media=daily.MediaSubscription.SUBSCRIBED_ALL,
        )
    }
)
 
client.set_participant_audio_renderer(
    participant_id,
    callback=self.on_audio_data,
    sample_rate=16_000,
    num_channels=1,
)

update_subscriptions tells Daily to pull all media from every remote participant. set_participant_audio_renderer converts the remote audio track into mono PCM16 at 16 kHz and calls your callback for each chunk.

3. Bridge to the Voice Agent API

async with websockets.connect(
    "wss://agents.assemblyai.com/v1/ws",
    additional_headers={"Authorization": f"Bearer {ASSEMBLYAI_API_KEY}"},
) as ws:
    await ws.send(json.dumps({
        "type": "session.update",
        "session": {
            "system_prompt": "You are a friendly voice assistant.",
            "greeting": "Hi - I'm joining the call.",
            "input":  {"format": {"encoding": "audio/pcm"}},
            "output": {"voice": "ivy", "format": {"encoding": "audio/pcm"}},
        },
    }))

After session.ready, every chunk of room audio gets resampled to 24 kHz, base64-encoded, and shipped as input.audio.

4. Publish reply audio back into the room

elif t == "reply.audio":
    pcm24 = base64.b64decode(event["data"])
    pcm_out = resample_pcm16(pcm24, 24_000, 16_000)
    await asyncio.to_thread(mic_device.write_frames, pcm_out)

mic_device is the virtual microphone registered with daily.Daily.create_microphone_device(...). Anything you write into it gets published into the Daily room as if a real human's mic produced it.

Interruption (Barge-In) handling

When a participant speaks while the agent is replying, the Voice Agent API emits reply.done with status: "interrupted". The bot doesn't need to flush anything on the Daily side — mic_device.write_frames only plays what you hand it, so stopping writes stops the agent.

elif t == "reply.done" and event.get("status") == "interrupted":
    pending_tools.clear()

Tuning

Pick a different voice

"output": {"voice": "james"}    # conversational US male
"output": {"voice": "sophie"}   # UK female
"output": {"voice": "diego"}    # Latin American Spanish
"output": {"voice": "arjun"}    # Hindi/Hinglish

Browse the Voices catalog. Multilingual voices code-switch automatically.

Tune the system prompt

"session": {
    "system_prompt": (
        "You are a sales-qualifying agent for Acme Corp on a video call. "
        "Ask one question at a time. Keep answers under two short sentences."
    ),
    "greeting": "Hey - thanks for hopping on. What brings you in today?",

Multi-participant calls

update_subscriptions with "*" subscribes the bot to every remote participant. The audio renderer fires per-participant. Today the bot mixes everyone into a single stream — for cleaner handling, either subscribe only to a designated primary participant or tag transcripts with the participant ID at the bridge.

Telephony in the same room

Daily supports SIP and PSTN dial-in, so phone callers can join the same room. Daily transcodes the carrier's 8 kHz G.711 audio for you — your renderer callback still gets PCM16 at whatever sample rate you asked for. No extra code.

Troubleshooting

Bot connects but never speaks. Check that MicrophonePublishingSettings(is_enabled=True) is set, and that you're calling mic_device.write_frames(...) with non-empty bytes.

Agent transcript is garbled. Wrong sample rate. The API expects 24 kHz for audio/pcm, not 16 kHz. Confirm your resample step is running.

Agent keeps interrupting itself. Two causes: (1) The bot is subscribing to its own virtual mic — filter self.client.local_participant().id. (2) Real humans without echo cancellation — plug in headphones.

UNAUTHORIZED close on the Voice Agent API socket. Bad or missing AssemblyAI key — check .env.

Daily join fails with 403. Room URL is wrong, the room expired, or it requires a meeting token.

Full troubleshooting guide: Voice Agent API docs.

Frequently asked questions

What is AssemblyAI's Voice Agent API?

A single WebSocket endpoint that handles the entire voice-agent pipeline server-side: speech recognition, LLM reasoning, and TTS. You send PCM audio in and get PCM audio back, with neural turn detection, barge-in, and tool calling built in.

How does the Daily.co Python SDK send audio to a voice agent?

The daily-python SDK exposes per-participant audio renderers via set_participant_audio_renderer. Daily decodes the remote WebRTC track to mono PCM16 and invokes your callback. You forward those bytes as input.audio events to the Voice Agent API.

What sample rate does it use?

The Voice Agent API defaults to 24 kHz PCM16 in both directions. Daily's renderer delivers 16 kHz. The bot resamples 16 kHz ↔ 24 kHz at the bridge.

How do I publish reply audio back into a Daily room?

Register a virtual mic with daily.Daily.create_microphone_device(...), enable it on join, and call mic_device.write_frames(pcm_bytes) with each reply chunk. Daily publishes it like a normal participant's mic.

Can the bot handle multiple humans?

Yes. update_subscriptions(participant_settings={"*": ...}) subscribes to everyone. The audio renderer fires per-participant so you know who's speaking.

Does this work with phone callers?

Yes. Daily supports SIP/PSTN dial-in. Daily transcodes the carrier audio for you — no extra code on your side.

How much does it cost?

AssemblyAI offers a free tier. Daily offers 10,000 free participant minutes/month. See the AssemblyAI pricing page.