All audio exchanged over the Voice Agent API is base64-encoded, mono. The encoding you choose determines the sample rate and bit depth. Input and output encodings are configured independently.
Supported encodings
Set the encoding for input (microphone) and output (agent speech) via session.update:
{
"type": "session.update",
"session": {
"input": {
"format": { "encoding": "audio/pcm" }
},
"output": {
"format": { "encoding": "audio/pcmu" }
}
}
}
| Encoding | Sample rate | Bit depth | Best for |
|---|
audio/pcm | 24,000 Hz | 16-bit signed integer (little-endian) | Default, highest quality, ideal for most apps |
audio/pcmu | 8,000 Hz | 8-bit μ-law | Telephony (G.711 μ-law) |
audio/pcma | 8,000 Hz | 8-bit A-law | Telephony (G.711 A-law) |
If you omit the format blocks, both input and output default to audio/pcm (24,000 Hz).
Choose the encoding that matches your audio pipeline. For browser and desktop apps, use audio/pcm (24 kHz). For telephony integrations where audio is already in G.711 format, use audio/pcmu or audio/pcma to avoid resampling. Input and output can use different encodings. For example, receive telephony audio at 8 kHz and send high-quality agent speech at 24 kHz.
Sending audio
Stream microphone audio as input.audio events. Each event contains a base64-encoded audio chunk in the configured encoding. Send chunks continuously. The server buffers them, so exact chunk size doesn’t matter. ~50 ms chunks work well.
import base64
# In your mic callback:
def mic_callback(indata, *_):
if session_ready.is_set():
loop.call_soon_threadsafe(mic_queue.put_nowait, bytes(indata))
# In your send loop:
async def send_audio():
while True:
chunk = await mic_queue.get()
await ws.send(json.dumps({
"type": "input.audio",
"audio": base64.b64encode(chunk).decode()
}))
Stream audio at roughly real time, not faster. The server expects roughly one second of audio per second of wall clock. If you push audio significantly faster than that (for example, replaying a recorded file as fast as the network allows), excess frames are dropped rather than buffered, and transcription will be incomplete. If you’re testing with a pre-recorded clip, pace the sends with asyncio.sleep to match the chunk duration.
Noise cancellation
Server-side noise cancellation is on by default. You cannot disable it. Send raw mic audio.
Do not stack a second denoising layer on top in your client:
const stream = await navigator.mediaDevices.getUserMedia({
audio: { noiseSuppression: false, echoCancellation: true },
});
Skip RNNoise, Krisp, BVC, and similar. Client-side denoising tends to introduce artifacts that hurt transcription accuracy more than the original noise did.
Echo cancellation
When the agent’s speech plays through speakers, the microphone can pick it up and send it back to the server, causing the agent to interrupt itself. To prevent this:
-
Browser apps: use
getUserMedia with echoCancellation enabled. The browser’s built-in acoustic echo cancellation (AEC) removes speaker output from the mic signal automatically:
const stream = await navigator.mediaDevices.getUserMedia({
audio: { echoCancellation: true, noiseSuppression: false }
});
-
Terminal / desktop apps: use headphones. Native audio APIs (PortAudio, sounddevice, etc.) don’t include echo cancellation, so the raw mic captures speaker output. Headphones eliminate the feedback path entirely.
Without echo cancellation or headphones, the agent’s own speech loops back through the microphone and triggers barge-in. Every response will be cut short with status: "interrupted". See Troubleshooting for more details.
Playing output audio
The server streams reply.audio events containing small audio chunks in the configured output encoding. Write each chunk directly into an audio output buffer and let the OS drain it at the correct sample rate:
SAMPLE_RATE = 24000 # 24 kHz for audio/pcm, 8 kHz for audio/pcmu or audio/pcma
with sd.OutputStream(samplerate=SAMPLE_RATE, channels=1, dtype="int16") as speaker:
# In your event loop:
if event["type"] == "reply.audio":
pcm = np.frombuffer(base64.b64decode(event["data"]), dtype=np.int16)
speaker.write(pcm)
speaker.write() copies samples into the OS audio buffer and returns immediately. The hardware drains the buffer at the configured rate, producing smooth playback. Network jitter is absorbed by the buffer, even if a WebSocket message arrives late, there’s still audio playing.
Don’t use sleep-based timing to schedule playback. The OS doesn’t guarantee exact sleep durations, so the playback clock drifts from the hardware clock over time, causing pops and gaps.# ❌ Don't do this
while True:
chunk = get_next_chunk()
play(chunk)
await asyncio.sleep(0.020) # drift accumulates → audio artifacts
Handling interruptions
When the user speaks while the agent is responding (barge-in), the server stops generating audio and signals the interruption. Your client should:
- Flush the audio buffer: discard any queued audio so the user doesn’t hear stale speech.
- Restart the output stream: so it’s ready for the next response.
if t == "reply.done":
if event.get("status") == "interrupted":
speaker.abort() # discard buffered audio
speaker.start() # restart stream for next response
The server emits two events on interruption:
Barge-in is semantic. Back-channels like “uh-huh” don’t trigger an interruption, but phrases like “wait, stop” do. See Turn detection and interruptions for how the model decides, and Session configuration for tuning barge-in sensitivity.
The speaker.abort() call above is specific to Python’s sounddevice. Each platform has its own way to flush a playback buffer:
| Platform | Flush approach |
|---|
| Python (sounddevice) | speaker.abort() then speaker.start() |
| Web (AudioContext) | Disconnect the source node, create a new AudioBufferSourceNode, and reconnect |
| iOS (AVAudioEngine) | Call playerNode.stop() then playerNode.play() to clear the scheduled buffer |
| Android (AudioTrack) | Call audioTrack.pause(), audioTrack.flush(), then audioTrack.play() |