input.format, output.format, and output.volume, set when you create or update it, or inline over the WebSocket via session.update.
This page covers how to configure the encoding and volume. For how to actually stream and play the audio bytes, see Stream audio.
Encoding
The encoding determines the sample rate and bit depth. Input and output encodings are independent and can differ. Both default toaudio/pcm (24 kHz) if omitted.
| Encoding | Sample rate | Best for |
|---|---|---|
audio/pcm | 24,000 Hz | Default. Highest quality, ideal for browser and app use. |
audio/pcmu | 8,000 Hz | Telephony (G.711 μ-law). |
audio/pcma | 8,000 Hz | Telephony (G.711 A-law). |
audio/pcmu or audio/pcma (8 kHz) to match the phone network and avoid resampling. See Connect to Twilio for a full phone integration.
Set format.encoding under input and output. You can also pass an explicit sample_rate inside format:
| Field | Type | Required | Notes |
|---|---|---|---|
input.format.encoding | string | No | audio/pcm, audio/pcmu, or audio/pcma. Default audio/pcm. |
output.format.encoding | string | No | Same values as input. Default audio/pcm. |
format.sample_rate | integer | No | Sample rate in Hz. Determined by the encoding if omitted. |
Volume
Adjust the playback volume of the agent’s speech viaoutput.volume. Accepts a number from 0 (silent) to 100 (loudest). If omitted, the voice plays at its native level.
| Field | Type | Required | Notes |
|---|---|---|---|
output.volume | number | null | No | 0 (silent) to 100 (loudest). null plays at native level. |
When configured inline via
session.update, output.voice and output.format are immutable after session.ready and must be set on your first update. output.volume is the exception: it can be changed mid-session, and the new value applies to subsequent reply.audio chunks.