Audio format - AssemblyAI

All audio exchanged over the Voice Agent API is base64-encoded, mono. The encoding you choose determines the sample rate and bit depth. Input and output encodings are configured independently.

Supported encodings

Set the encoding for input (microphone) and output (agent speech) via session.update:

{
  "type": "session.update",
  "session": {
    "input": {
      "format": { "encoding": "audio/pcm" }
    },
    "output": {
      "format": { "encoding": "audio/pcmu" }
    }
  }
}

Encoding	Sample rate	Bit depth	Best for
`audio/pcm`	24,000 Hz	16-bit signed integer (little-endian)	Default, highest quality, ideal for most apps
`audio/pcmu`	8,000 Hz	8-bit μ-law	Telephony (G.711 μ-law)
`audio/pcma`	8,000 Hz	8-bit A-law	Telephony (G.711 A-law)

If you omit the format blocks, both input and output default to audio/pcm (24,000 Hz).

Choose the encoding that matches your audio pipeline. For browser and desktop apps, use audio/pcm (24 kHz). For telephony integrations where audio is already in G.711 format, use audio/pcmu or audio/pcma to avoid resampling. Input and output can use different encodings. For example, receive telephony audio at 8 kHz and send high-quality agent speech at 24 kHz.

Sending audio

Stream microphone audio as input.audio events. Each event contains a base64-encoded audio chunk in the configured encoding. Send chunks continuously. The server buffers them, so exact chunk size doesn’t matter. ~50 ms chunks work well.

import base64

# In your mic callback:
def mic_callback(indata, *_):
    if session_ready.is_set():
        loop.call_soon_threadsafe(mic_queue.put_nowait, bytes(indata))

# In your send loop:
async def send_audio():
    while True:
        chunk = await mic_queue.get()
        await ws.send(json.dumps({
            "type": "input.audio",
            "audio": base64.b64encode(chunk).decode()
        }))

Only start sending input.audio after you receive session.ready.

Stream audio at roughly real time, not faster. The server expects roughly one second of audio per second of wall clock. If you push audio significantly faster than that (for example, replaying a recorded file as fast as the network allows), excess frames are dropped rather than buffered, and transcription will be incomplete. If you’re testing with a pre-recorded clip, pace the sends with asyncio.sleep to match the chunk duration.

Noise cancellation

Server-side noise cancellation is on by default. You cannot disable it. Send raw mic audio. Do not stack a second denoising layer on top in your client:

const stream = await navigator.mediaDevices.getUserMedia({
  audio: { noiseSuppression: false, echoCancellation: true },
});

Skip RNNoise, Krisp, BVC, and similar. Client-side denoising tends to introduce artifacts that hurt transcription accuracy more than the original noise did.

Echo cancellation

When the agent’s speech plays through speakers, the microphone can pick it up and send it back to the server, causing the agent to interrupt itself. To prevent this:

Browser apps: use getUserMedia with echoCancellation enabled. The browser’s built-in acoustic echo cancellation (AEC) removes speaker output from the mic signal automatically:
```
const stream = await navigator.mediaDevices.getUserMedia({
  audio: { echoCancellation: true, noiseSuppression: false }
});
```
Terminal / desktop apps: use headphones. Native audio APIs (PortAudio, sounddevice, etc.) don’t include echo cancellation, so the raw mic captures speaker output. Headphones eliminate the feedback path entirely.

Without echo cancellation or headphones, the agent’s own speech loops back through the microphone and triggers barge-in. Every response will be cut short with status: "interrupted". See Troubleshooting for more details.

Playing output audio

The server streams reply.audio events containing small audio chunks in the configured output encoding. Write each chunk directly into an audio output buffer and let the OS drain it at the correct sample rate:

SAMPLE_RATE = 24000  # 24 kHz for audio/pcm, 8 kHz for audio/pcmu or audio/pcma

with sd.OutputStream(samplerate=SAMPLE_RATE, channels=1, dtype="int16") as speaker:
    # In your event loop:
    if event["type"] == "reply.audio":
        pcm = np.frombuffer(base64.b64decode(event["data"]), dtype=np.int16)
        speaker.write(pcm)

speaker.write() copies samples into the OS audio buffer and returns immediately. The hardware drains the buffer at the configured rate, producing smooth playback. Network jitter is absorbed by the buffer, even if a WebSocket message arrives late, there’s still audio playing.

Don’t use sleep-based timing to schedule playback. The OS doesn’t guarantee exact sleep durations, so the playback clock drifts from the hardware clock over time, causing pops and gaps.

# ❌ Don't do this
while True:
    chunk = get_next_chunk()
    play(chunk)
    await asyncio.sleep(0.020)  # drift accumulates → audio artifacts

Handling interruptions

When the user speaks while the agent is responding (barge-in), the server stops generating audio and signals the interruption. Your client should:

Flush the audio buffer: discard any queued audio so the user doesn’t hear stale speech.
Restart the output stream: so it’s ready for the next response.

if t == "reply.done":
    if event.get("status") == "interrupted":
        speaker.abort()  # discard buffered audio
        speaker.start()  # restart stream for next response

The server emits two events on interruption:

reply.done with status: "interrupted"
transcript.agent with interrupted: true and text trimmed to what was spoken

Barge-in is semantic. Back-channels like “uh-huh” don’t trigger an interruption, but phrases like “wait, stop” do. See Turn detection and interruptions for how the model decides, and Session configuration for tuning barge-in sensitivity.

Platform-specific audio flush

The speaker.abort() call above is specific to Python’s sounddevice. Each platform has its own way to flush a playback buffer:

Platform	Flush approach
Python (sounddevice)	`speaker.abort()` then `speaker.start()`
Web (AudioContext)	Disconnect the source node, create a new `AudioBufferSourceNode`, and reconnect
iOS (AVAudioEngine)	Call `playerNode.stop()` then `playerNode.play()` to clear the scheduled buffer
Android (AudioTrack)	Call `audioTrack.pause()`, `audioTrack.flush()`, then `audioTrack.play()`

​Supported encodings

​Sending audio

​Noise cancellation

​Echo cancellation

​Playing output audio

​Handling interruptions

​Platform-specific audio flush

Supported encodings

Sending audio

Noise cancellation

Echo cancellation

Playing output audio

Handling interruptions

Platform-specific audio flush