Voice Agent API

Audio format

PCM16 audio in, PCM16 audio out — how to stream and play Voice Agent API audio.

Both the audio you send (microphone → server) and the audio you receive (server → speaker) use the same format: base64-encoded PCM16, mono, 24 kHz. Send ~50 ms chunks (2,400 bytes) — the server buffers continuously so exact chunk size doesn’t matter.

PropertyValue
EncodingPCM16 (16-bit signed integer, little-endian)
Sample rate24,000 Hz
ChannelsMono
TransportBase64-encoded (not raw binary)

Playing output audio

The server streams reply.audio events containing small PCM16 chunks. Write each chunk directly into an audio output buffer and let the OS drain it at the correct sample rate:

1# ✅ Buffer-based playback
2with sd.OutputStream(samplerate=24000, channels=1, dtype="int16") as speaker:
3 # In your event loop:
4 if event["type"] == "reply.audio":
5 pcm = np.frombuffer(base64.b64decode(event["data"]), dtype=np.int16)
6 speaker.write(pcm)

speaker.write() copies samples into the OS audio buffer and returns immediately. The hardware drains the buffer at exactly 24kHz, producing smooth playback. Network jitter is absorbed by the buffer — even if a WebSocket message arrives late, there’s still audio playing.

Don’t use sleep-based timing to schedule playback. The OS doesn’t guarantee exact sleep durations, so the playback clock drifts from the hardware clock over time, causing pops and gaps in the audio.

1# ❌ Don't do this
2while True:
3 chunk = get_next_chunk()
4 play(chunk)
5 await asyncio.sleep(0.020) # drift accumulates → audio artifacts

Stopping playback on interruption

When the user interrupts the agent, the server stops generating audio and sends reply.done with status: "interrupted". Your output buffer may still have queued audio from before the interruption. Flush it so the user doesn’t hear stale speech:

1if event.get("status") == "interrupted":
2 speaker.abort() # discard buffered audio immediately
3 speaker.start() # restart the stream for the next response

See Session configuration for tuning barge-in behavior.