Browser integration

Connect a browser to your stored agent in three steps:

Your server calls GET /v1/token with your API key to mint a short-lived temporary token.
Your browser opens the WebSocket with ?token=<token>, no API key exposed.
The browser sends one session.update with your agent_id; the agent’s stored prompt, voice, and tools load automatically.

Your API key never leaves your server. Each token is single-use, it starts exactly one session, and all usage is attributed to the key that generated it.

This page connects to a stored agent by agent_id — the recommended path. If you’d rather configure the agent inline per session instead of creating one, send system_prompt / greeting / output in the session.update and omit agent_id. The two are mutually exclusive. See Inline configuration.

Browsers provide built-in acoustic echo cancellation through getUserMedia, so browser-based clients work hands-free without headphones. If you’re developing on a laptop, the browser integration is the recommended starting point.

1. Generate a token on your server

Call GET /v1/token with your API key in the Authorization header. Pick an expires_in_seconds short enough to limit replay risk (60–300s is a good default) and an optional max_session_duration_seconds to cap the session length.

These two parameters control different things and are easy to confuse:

expires_in_seconds is the token redemption window: how long the client has to use this token to open a WebSocket. If the window elapses before the WebSocket is opened, the server returns a session.error with code unauthorized on the first frame instead of session.ready. Once a session.ready has been received, this value no longer applies.
max_session_duration_seconds is the session duration cap: how long the resulting voice agent session is allowed to run after the WebSocket is open.

// server/routes/voice-token.js
import express from "express";

const router = express.Router();

router.get("/voice-token", async (_req, res) => {
  const url = new URL("https://agents.assemblyai.com/v1/token");
  url.searchParams.set("expires_in_seconds", "300");
  url.searchParams.set("max_session_duration_seconds", "8640");

  const response = await fetch(url, {
    headers: { Authorization: `Bearer ${process.env.ASSEMBLYAI_API_KEY}` },
  });

  if (!response.ok) {
    return res.status(response.status).send(await response.text());
  }

  const { token } = await response.json();
  res.json({ token });
});

export default router;

expires_in_seconds must be between 1 and 600. max_session_duration_seconds must be between 60 and 10800 (defaults to 10800, the 3-hour maximum session duration).

Tokens are single-use — fetch a fresh one immediately before every connection, including reconnects via session.resume. End sessions cleanly with session.end so you don’t pay for the 30-second resume grace window.

2. Connect from the browser with the token

Fetch the token from your server, open the WebSocket with ?token=<token> (no Authorization header needed), and bind to your agent by agent_id:

// browser/voice-agent.js
const AGENT_ID = "7ad24396-b822-4dca-871a-be9cc4781cf9"; // from POST /v1/agents

const { token } = await fetch("/api/voice-token").then((r) => r.json());

const wsUrl = new URL("wss://agents.assemblyai.com/v1/ws");
wsUrl.searchParams.set("token", token);
const ws = new WebSocket(wsUrl);

ws.addEventListener("open", () => {
  ws.send(
    JSON.stringify({
      type: "session.update",
      session: { agent_id: AGENT_ID },
    }),
  );
});

ws.addEventListener("message", (event) => {
  const message = JSON.parse(event.data);
  // Handle session.ready, reply.audio, transcript.*, tool.call, etc.
  console.log(message);
});

3. Browser quickstart

A complete working example that captures microphone audio, streams it to the Voice Agent API, and plays back the agent’s response. This requires two files, an HTML page and an AudioWorklet processor.

AudioWorklet processors load from a URL, so this needs two files. Serve them locally with npx serve ..

Create pcm-processor.js in the same directory as your HTML file:

// pcm-processor.js - AudioWorklet that captures PCM16 from the mic
class PCMProcessor extends AudioWorkletProcessor {
  process(inputs) {
    const input = inputs[0]?.[0];
    if (input) {
      // Convert Float32 [-1, 1] to Int16
      const pcm16 = new Int16Array(input.length);
      for (let i = 0; i < input.length; i++) {
        pcm16[i] = Math.max(-32768, Math.min(32767, Math.round(input[i] * 32767)));
      }
      this.port.postMessage(pcm16.buffer, [pcm16.buffer]);
    }
    return true;
  }
}

registerProcessor("pcm-processor", PCMProcessor);

Then create your HTML file:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Voice Agent</title>
</head>
<body>
  <button id="start">Start conversation</button>
  <pre id="log"></pre>
  <script>
    const log = (msg) => { document.getElementById("log").textContent += msg + "\n"; };

    // Your stored agent's ID, from POST /v1/agents
    const AGENT_ID = "7ad24396-b822-4dca-871a-be9cc4781cf9";

    document.getElementById("start").addEventListener("click", async () => {
      // 1. Get token from your server (see step 1 above)
      const { token } = await fetch("/api/voice-token").then((r) => r.json());

      // 2. Force AudioContext to 24 kHz - avoids manual resampling on both
      //    capture and playback in Chromium and Firefox. Safari ignores this
      //    option (see Browser compatibility below) and runs at the hardware
      //    rate, so production code should resample inside the worklet.
      const audioCtx = new AudioContext({ sampleRate: 24000 });
      await audioCtx.audioWorklet.addModule("pcm-processor.js");

      // 3. Capture mic audio with echo cancellation enabled
      const stream = await navigator.mediaDevices.getUserMedia({
        audio: { echoCancellation: true, sampleRate: 24000 },
      });
      const source = audioCtx.createMediaStreamSource(stream);
      const worklet = new AudioWorkletNode(audioCtx, "pcm-processor");

      // 4. Connect WebSocket
      const wsUrl = new URL("wss://agents.assemblyai.com/v1/ws");
      wsUrl.searchParams.set("token", token);
      const ws = new WebSocket(wsUrl);

      let ready = false;
      let playbackTime = audioCtx.currentTime;

      // Send mic audio to the server once the session is ready
      worklet.port.onmessage = (e) => {
        if (ready && ws.readyState === WebSocket.OPEN) {
          const b64 = btoa(String.fromCharCode(...new Uint8Array(e.data)));
          ws.send(JSON.stringify({ type: "input.audio", audio: b64 }));
        }
      };
      source.connect(worklet).connect(audioCtx.destination);

      ws.addEventListener("open", () => {
        ws.send(JSON.stringify({
          type: "session.update",
          session: { agent_id: AGENT_ID },
        }));
      });

      ws.addEventListener("message", (event) => {
        const msg = JSON.parse(event.data);

        if (msg.type === "session.ready") {
          ready = true;
          log("Session ready, start speaking");
        } else if (msg.type === "reply.audio") {
          // Decode base64 PCM16 and schedule playback
          const raw = atob(msg.data);
          const pcm16 = new Int16Array(raw.length / 2);
          for (let i = 0; i < pcm16.length; i++) {
            pcm16[i] = raw.charCodeAt(i * 2) | (raw.charCodeAt(i * 2 + 1) << 8);
          }
          const float32 = new Float32Array(pcm16.length);
          for (let i = 0; i < pcm16.length; i++) {
            float32[i] = pcm16[i] / 32768;
          }
          const buffer = audioCtx.createBuffer(1, float32.length, 24000);
          buffer.getChannelData(0).set(float32);
          const src = audioCtx.createBufferSource();
          src.buffer = buffer;
          src.connect(audioCtx.destination);
          const now = audioCtx.currentTime;
          playbackTime = Math.max(playbackTime, now);
          src.start(playbackTime);
          playbackTime += buffer.duration;
        } else if (msg.type === "reply.done" && msg.status === "interrupted") {
          // Reset playback schedule to avoid stale audio
          playbackTime = audioCtx.currentTime;
        } else if (msg.type === "transcript.user") {
          log("You: " + msg.text);
        } else if (msg.type === "transcript.agent") {
          log("Agent: " + msg.text);
        } else if (msg.type === "session.error" || msg.type === "error") {
          log("Error: " + msg.message);
        }
      });

      ws.addEventListener("close", () => log("Connection closed"));
    });
  </script>
</body>
</html>

The key line is new AudioContext({ sampleRate: 24000 }). Most browsers default to the device sample rate (usually 48 kHz), so without this you’d need to manually resample both mic input and playback output. Forcing 24 kHz on the context avoids this entirely. Safari ignores this option and runs at the hardware rate. See Browser compatibility for a Safari-safe pipeline.

4. Browser compatibility

The quickstart above works as-is on Chromium-based browsers (Chrome, Edge, Brave, Arc) and Firefox. Safari has a known quirk that produces silently garbled audio if you don’t account for it.

Browser	`AudioContext({ sampleRate })` honored	Recommended pipeline
Chrome / Edge	Yes	Use the quickstart as-is.
Firefox	Yes	Use the quickstart as-is.
Safari (desktop, iOS)	No, runs at hardware rate (typically 48 kHz)	Let `AudioContext` use its default rate and resample to/from 24 kHz inside the worklet (capture) and before playback.

Safari: resample inside the worklet

Safari ignores the sampleRate constructor option, so an AudioContext({ sampleRate: 24000 }) will silently run at 48 kHz on most Macs. Sending those samples to the Voice Agent API as if they were 24 kHz produces audio that sounds chipmunked or garbled. Detect the actual context rate at runtime, send it into the worklet, and resample there:

// browser/voice-agent.js (Safari-safe context)
const audioCtx = new AudioContext(); // let Safari pick its hardware rate
await audioCtx.audioWorklet.addModule("pcm-processor.js");
const worklet = new AudioWorkletNode(audioCtx, "pcm-processor", {
  processorOptions: { inputSampleRate: audioCtx.sampleRate, targetSampleRate: 24000 },
});

// pcm-processor.js: linear resample to 24 kHz before posting PCM16
class PCMProcessor extends AudioWorkletProcessor {
  constructor(options) {
    super();
    const { inputSampleRate, targetSampleRate } = options.processorOptions;
    this.ratio = inputSampleRate / targetSampleRate;
  }
  process(inputs) {
    const input = inputs[0]?.[0];
    if (!input) return true;
    const outLength = Math.floor(input.length / this.ratio);
    const pcm16 = new Int16Array(outLength);
    for (let i = 0; i < outLength; i++) {
      const sample = input[Math.floor(i * this.ratio)] ?? 0;
      pcm16[i] = Math.max(-32768, Math.min(32767, Math.round(sample * 32767)));
    }
    this.port.postMessage(pcm16.buffer, [pcm16.buffer]);
    return true;
  }
}
registerProcessor("pcm-processor", PCMProcessor);

For playback, createBuffer(1, length, 24000) works on all current browsers — the context resamples on output. Linear interpolation is good enough for speech.

Cross-browser checklist

User gesture required. All major browsers gate getUserMedia and AudioContext startup behind a user gesture (Safari is strictest). Start audio inside a click or touchstart handler and call await audioCtx.resume() before connecting nodes.
HTTPS or localhost. getUserMedia only works on secure origins.
Echo cancellation. Pass echoCancellation: true to getUserMedia so the agent’s TTS playing through the speakers doesn’t get re-captured by the mic.
Audio output sink. On iOS Safari, set the <audio playsinline> attribute or route through an AudioContext destination. Autoplay and full-screen behavior differ from desktop.

5. Ending the session cleanly

Do not just close the WebSocket. Send session.end first, then close. A bare ws.close() (or the browser tearing the socket down on navigation) leaves the session in the 30-second session.resume grace window, and that window is billable.

In the browser, wire session.end to both your explicit end-call control and the pagehide event so tab close and navigation are covered. See Ending the session cleanly on the Voice Agent API overview for the full pattern and code sample.

Getting started

Create & manage agents

Agent behavior

Conversational experience

Deploy

After a session

Reference

API reference

1. Generate a token on your server

2. Connect from the browser with the token

3. Browser quickstart

4. Browser compatibility

Safari: resample inside the worklet

Cross-browser checklist

5. Ending the session cleanly

​1. Generate a token on your server

​2. Connect from the browser with the token

​3. Browser quickstart

​4. Browser compatibility

​Safari: resample inside the worklet

​Cross-browser checklist

​5. Ending the session cleanly

1. Generate a token on your server

2. Connect from the browser with the token

3. Browser quickstart

4. Browser compatibility

Safari: resample inside the worklet

Cross-browser checklist

5. Ending the session cleanly