Voice Agent API
Connect to the Voice Agent API to run a real-time voice conversation. The client streams PCM16 audio to the server and receives the agent’s spoken response (also PCM16), along with transcripts, tool calls, and lifecycle events.
See the Voice Agent API overview for the full event flow and a runnable quickstart.
Handshake
Headers
Query parameters
Temporary authentication token for client-side connections. Generate one with GET /v1/token on your server and pass it here so you don’t expose your permanent API key in the browser. Each token is one-time use.
Send
Configure the session. Send immediately on connect — before session.ready — to set the
system prompt, greeting, voice, tools, and turn detection behavior. Can also be sent
mid-conversation to update any of these fields.
Resume a previous session using the session_id from a prior session.ready. Preserves
conversation context across dropped connections. Sessions are held for 30 seconds after
every disconnection.
Stream a chunk of user audio to the agent. Only send input.audio after session.ready.
See Audio format for the expected
encoding (PCM16 mono 24kHz, base64).
Return a tool result to the agent. Send this inside your reply.done handler — not
immediately on tool.call. See
Tool calling.
Receive
Session is established. Save session_id for reconnection and start streaming audio.
Sent after a session.update is applied successfully.
A session- or protocol-level error occurred.
Partial transcript of the user’s utterance, updating in real-time.
A chunk of the agent’s spoken response (base64 PCM16). Decode and play immediately. See Audio format for playback guidance.
Agent has finished speaking. Send any accumulated tool.result events here.
Agent wants to invoke a registered tool. Execute the tool, then send the result with
tool.result after reply.done fires.