Voice Agent API

Browser integration

Connect browser-based apps to the Voice Agent API using a temporary token.

Connect a browser to the Voice Agent API in two steps:

  1. Your server calls GET /v1/token with your API key to mint a short-lived temporary token.
  2. Your browser opens the WebSocket with ?token=<token> — no API key exposed.

Your API key never leaves your server. Each token is single-use — it starts exactly one session, and all usage is attributed to the key that generated it.

Browsers provide built-in acoustic echo cancellation through getUserMedia, so browser-based clients work hands-free without headphones. If you’re developing on a laptop, the browser integration is the recommended starting point.

1. Generate a token on your server

Call GET /v1/token with your API key in the Authorization header. Pick an expires_in_seconds short enough to limit replay risk (60–300s is a good default) and an optional max_session_duration_seconds to cap the session length.

GET
/v1/token
1curl -G https://agents.assemblyai.com/v1/token \
2 -H "Authorization: <apiKey>" \
3 -d expires_in_seconds=300
1// server/routes/voice-token.js
2import express from "express";
3
4const router = express.Router();
5
6router.get("/voice-token", async (_req, res) => {
7 const url = new URL("https://agents.assemblyai.com/v1/token");
8 url.searchParams.set("expires_in_seconds", "300");
9 url.searchParams.set("max_session_duration_seconds", "8640");
10
11 const response = await fetch(url, {
12 headers: { Authorization: `Bearer ${process.env.ASSEMBLYAI_API_KEY}` },
13 });
14
15 if (!response.ok) {
16 return res.status(response.status).send(await response.text());
17 }
18
19 const { token } = await response.json();
20 res.json({ token });
21});
22
23export default router;

expires_in_seconds must be between 1 and 600. max_session_duration_seconds must be between 60 and 10800 (defaults to 10800, the 3-hour maximum session duration).

Token expiry and failure modes

If a token is missing, expired, or invalid, the server rejects the handshake with an UNAUTHORIZED error (close code 1008). In browsers, this may surface as a close event with code 1006 and no body — you won’t receive a session.error event. Always fetch a fresh token immediately before each connection attempt.

If the WebSocket drops mid-session and you need to reconnect with session.resume, you’ll need a new token for the new WebSocket — the original token can’t be reused.

2. Connect from the browser with the token

Fetch the token from your server, then open the WebSocket with ?token=<token>. No Authorization header is needed.

1// browser/voice-agent.js
2const { token } = await fetch("/api/voice-token").then((r) => r.json());
3
4const wsUrl = new URL("wss://agents.assemblyai.com/v1/ws");
5wsUrl.searchParams.set("token", token);
6const ws = new WebSocket(wsUrl);
7
8ws.addEventListener("open", () => {
9 ws.send(
10 JSON.stringify({
11 type: "session.update",
12 session: {
13 system_prompt: "You are a helpful voice assistant.",
14 greeting: "Hi there! How can I help you today?",
15 output: { voice: "ivy" },
16 },
17 }),
18 );
19});
20
21ws.addEventListener("message", (event) => {
22 const message = JSON.parse(event.data);
23 // Handle session.ready, reply.audio, transcript.*, tool.call, etc.
24 console.log(message);
25});

Fetch a fresh token for every new WebSocket connection. Tokens are single-use — a dropped connection needs a new token to reconnect (including when using session.resume).

3. Browser quickstart

A complete working example that captures microphone audio, streams it to the Voice Agent API, and plays back the agent’s response. This requires two files — an HTML page and an AudioWorklet processor.

AudioWorklet processors must be loaded from a URL (audioContext.audioWorklet.addModule(url)), so you need at least two files. This example won’t work in a single-file environment like CodePen or JSFiddle without modifications. Use a local server (npx serve .) or a framework with static file support.

Create pcm-processor.js in the same directory as your HTML file:

1// pcm-processor.js — AudioWorklet that captures PCM16 from the mic
2class PCMProcessor extends AudioWorkletProcessor {
3 process(inputs) {
4 const input = inputs[0]?.[0];
5 if (input) {
6 // Convert Float32 [-1, 1] to Int16
7 const pcm16 = new Int16Array(input.length);
8 for (let i = 0; i < input.length; i++) {
9 pcm16[i] = Math.max(-32768, Math.min(32767, Math.round(input[i] * 32767)));
10 }
11 this.port.postMessage(pcm16.buffer, [pcm16.buffer]);
12 }
13 return true;
14 }
15}
16
17registerProcessor("pcm-processor", PCMProcessor);

Then create your HTML file:

1<!DOCTYPE html>
2<html lang="en">
3<head>
4 <meta charset="UTF-8">
5 <title>Voice Agent</title>
6</head>
7<body>
8 <button id="start">Start conversation</button>
9 <pre id="log"></pre>
10 <script>
11 const log = (msg) => { document.getElementById("log").textContent += msg + "\n"; };
12
13 document.getElementById("start").addEventListener("click", async () => {
14 // 1. Get token from your server (see step 1 above)
15 const { token } = await fetch("/api/voice-token").then((r) => r.json());
16
17 // 2. Force AudioContext to 24 kHz — avoids manual resampling on both
18 // capture and playback. Most laptops default to 48 kHz without this.
19 const audioCtx = new AudioContext({ sampleRate: 24000 });
20 await audioCtx.audioWorklet.addModule("pcm-processor.js");
21
22 // 3. Capture mic audio with echo cancellation enabled
23 const stream = await navigator.mediaDevices.getUserMedia({
24 audio: { echoCancellation: true, sampleRate: 24000 },
25 });
26 const source = audioCtx.createMediaStreamSource(stream);
27 const worklet = new AudioWorkletNode(audioCtx, "pcm-processor");
28
29 // 4. Connect WebSocket
30 const wsUrl = new URL("wss://agents.assemblyai.com/v1/ws");
31 wsUrl.searchParams.set("token", token);
32 const ws = new WebSocket(wsUrl);
33
34 let ready = false;
35 let playbackTime = audioCtx.currentTime;
36
37 // Send mic audio to the server once the session is ready
38 worklet.port.onmessage = (e) => {
39 if (ready && ws.readyState === WebSocket.OPEN) {
40 const b64 = btoa(String.fromCharCode(...new Uint8Array(e.data)));
41 ws.send(JSON.stringify({ type: "input.audio", audio: b64 }));
42 }
43 };
44 source.connect(worklet).connect(audioCtx.destination);
45
46 ws.addEventListener("open", () => {
47 ws.send(JSON.stringify({
48 type: "session.update",
49 session: {
50 system_prompt: "You are a helpful voice assistant. Keep responses concise.",
51 greeting: "Hi! How can I help you?",
52 output: { voice: "ivy" },
53 },
54 }));
55 });
56
57 ws.addEventListener("message", (event) => {
58 const msg = JSON.parse(event.data);
59
60 if (msg.type === "session.ready") {
61 ready = true;
62 log("Session ready — start speaking");
63 } else if (msg.type === "reply.audio") {
64 // Decode base64 PCM16 and schedule playback
65 const raw = atob(msg.data);
66 const pcm16 = new Int16Array(raw.length / 2);
67 for (let i = 0; i < pcm16.length; i++) {
68 pcm16[i] = raw.charCodeAt(i * 2) | (raw.charCodeAt(i * 2 + 1) << 8);
69 }
70 const float32 = new Float32Array(pcm16.length);
71 for (let i = 0; i < pcm16.length; i++) {
72 float32[i] = pcm16[i] / 32768;
73 }
74 const buffer = audioCtx.createBuffer(1, float32.length, 24000);
75 buffer.getChannelData(0).set(float32);
76 const src = audioCtx.createBufferSource();
77 src.buffer = buffer;
78 src.connect(audioCtx.destination);
79 const now = audioCtx.currentTime;
80 playbackTime = Math.max(playbackTime, now);
81 src.start(playbackTime);
82 playbackTime += buffer.duration;
83 } else if (msg.type === "reply.done" && msg.status === "interrupted") {
84 // Reset playback schedule to avoid stale audio
85 playbackTime = audioCtx.currentTime;
86 } else if (msg.type === "transcript.user") {
87 log("You: " + msg.text);
88 } else if (msg.type === "transcript.agent") {
89 log("Agent: " + msg.text);
90 } else if (msg.type === "session.error" || msg.type === "error") {
91 log("Error: " + msg.message);
92 }
93 });
94
95 ws.addEventListener("close", () => log("Connection closed"));
96 });
97 </script>
98</body>
99</html>

The key line is new AudioContext({ sampleRate: 24000 }). Browsers default to the device sample rate (usually 48 kHz), so without this you’d need to manually resample both mic input and playback output. Forcing 24 kHz on the context avoids this entirely.