Audio format
PCM16 audio in, PCM16 audio out — how to stream and play Voice Agent API audio.
Both the audio you send (microphone → server) and the audio you receive (server → speaker) use the same format: base64-encoded PCM16, mono, 24 kHz. Send ~50 ms chunks (2,400 bytes) — the server buffers continuously so exact chunk size doesn’t matter.
Playing output audio
The server streams reply.audio events containing small PCM16 chunks. Write each chunk directly into an audio output buffer and let the OS drain it at the correct sample rate:
speaker.write() copies samples into the OS audio buffer and returns immediately. The hardware drains the buffer at exactly 24kHz, producing smooth playback. Network jitter is absorbed by the buffer — even if a WebSocket message arrives late, there’s still audio playing.
Don’t use sleep-based timing to schedule playback. The OS doesn’t guarantee exact sleep durations, so the playback clock drifts from the hardware clock over time, causing pops and gaps in the audio.
Stopping playback on interruption
When the user interrupts the agent, the server stops generating audio and sends reply.done with status: "interrupted". Your output buffer may still have queued audio from before the interruption. Flush it so the user doesn’t hear stale speech:
See Session configuration for tuning barge-in behavior.