Build a voice assistant app with AssemblyAI’s Voice Agent API
A production-shaped browser voice assistant in under 400 lines. Click Connect, talk into your laptop mic, and the agent talks back — over a single WebSocket, with the API key safely behind a tiny Node server.



This is the “real app” version of the 5-minute quickstart: a polished UI, AudioWorklet mic capture, temporary-token auth, and full barge-in handling. The AssemblyAI Voice Agent API does the speech recognition, the LLM, and the TTS server-side — you’re just shuttling audio bytes.
Why One WebSocket Beats a Multi-Service Pipeline
A traditional browser voice agent needs you to wire up streaming STT, an LLM, and a TTS provider, then orchestrate audio routing between them in the browser. Every hop adds latency, every provider needs a key, and every glue layer adds a failure mode.
The endpoint is one URL: wss://agents.assemblyai.com/v1/ws. Send 24 kHz PCM, get 24 kHz PCM back. That’s it.
Architecture
The system has two halves: a browser client and a lightweight Node server.
Data flow: The browser gets a token from the Node server, opens a WebSocket to the Voice Agent API with that token, streams mic PCM up, and receives reply PCM + transcript events back.
Prerequisites
- Node.js 18+ (uses native fetch and ES modules)
- A modern browser (Chrome 66+, Firefox 76+, Safari 14.1+ — anything with AudioWorklet)
- An AssemblyAI API key — free tier available
The browser needs a secure origin to access the mic. http://localhost counts as secure, so you can develop locally without TLS. If you deploy elsewhere, serve over HTTPS.
Quick Start
1. Clone and Install
git clone https://github.com/kelsey-aai/voice-assistant-app
cd voice-assistant-app
npm install2. Configure Your API Key
cp .env.example .env
# Edit .env — drop in your AssemblyAI API key
3. Run the App
npm startOpen http://localhost:3000, pick a voice, hit Connect, grant mic permission, and start talking. You’ll see your speech transcribed live as a partial bubble, then committed to a final bubble, with the agent’s reply streaming back in audio and text.
How It Works
There are four moving parts: the token mint, the AudioWorklet that captures mic audio, the WebSocket loop that drives the conversation, and the playback scheduler that turns reply.audio events back into sound.
1. The Server Mints a Temporary Token
Your AssemblyAI API key never leaves the server. The browser asks /api/voice-token for a single-use token, valid for 5 minutes:
// server.js
app.get("/api/voice-token", async (_req, res) => {
const url = new URL("https://agents.assemblyai.com/v1/token");
url.searchParams.set("expires_in_seconds", "300");
const response = await fetch(url, {
headers: { Authorization: `Bearer ${API_KEY}` },
});
const { token } = await response.json();
res.json({ token });
});Tokens are single-use — you fetch a fresh one for every connection. The browser then opens the WebSocket with the token as a query parameter:
ws = new WebSocket(`wss://agents.assemblyai.com/v1/ws?token=${token}`);2. The AudioWorklet Captures Mic Audio
AudioWorklet runs your PCM conversion off the main thread, which keeps it glitch-free. The worklet receives Float32 samples, clips them to range, and posts them back as a transferable Int16Array buffer:
// public/pcm-processor.js
class PCMProcessor extends AudioWorkletProcessor {
process(inputs) {
const channel = inputs[0]?.[0];
if (!channel) return true;
const pcm = new Int16Array(channel.length);
for (let i = 0; i < channel.length; i++) {
const s = Math.max(-1, Math.min(1, channel[i]));
pcm[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
}
this.port.postMessage(pcm.buffer, [pcm.buffer]);
return true;
}
}The Voice Agent API expects 24 kHz PCM by default, so we force the entire AudioContext to 24 kHz on creation — no resampling needed:
audioCtx = new AudioContext({ sampleRate: SAMPLE_RATE }); // 24000We also enable browser-level acoustic echo cancellation when grabbing the mic, so the agent doesn’t interrupt itself by hearing its own TTS through the speakers:
navigator.mediaDevices.getUserMedia({
audio: {
echoCancellation: true,
noiseSuppression: true,
autoGainControl: true,
channelCount: 1,
},
});3. The WebSocket Drives the Conversation
session.update is the first message — it configures system prompt, greeting, and voice. After that, you stream input.audio events whenever the worklet hands you a frame:
ws.onopen = () => {
ws.send(JSON.stringify({
type: "session.update",
session: {
system_prompt: prompt.value,
greeting: "Hi there — what can I help you with?",
output: { voice: "ivy" },
},
}));
};
workletNode.port.onmessage = (e) => {
ws.send(JSON.stringify({
type: "input.audio",
audio: arrayBufferToBase64(e.data),
}));
};The server replies with a stream of events. The ones we care about for UI:
4. Reply Playback Uses a Scheduling Cursor
reply.audio chunks arrive faster than they play — sometimes the whole reply is buffered before the first sample hits the speaker. Naively calling source.start(0) would overlap the chunks. Instead, we keep a playbackTime cursor and schedule each chunk back-to-back:
function playPCM(b64) {
const bytes = Uint8Array.from(atob(b64), (c) => c.charCodeAt(0));
const int16 = new Int16Array(bytes.buffer, bytes.byteOffset,
bytes.byteLength / 2);
const float = new Float32Array(int16.length);
for (let i = 0; i < int16.length; i++)
float[i] = int16[i] / 0x8000;
const buffer = audioCtx.createBuffer(1, float.length, SAMPLE_RATE);
buffer.getChannelData(0).set(float);
const source = audioCtx.createBufferSource();
source.buffer = buffer;
source.connect(audioCtx.destination);
const now = audioCtx.currentTime;
if (playbackTime < now) playbackTime = now;
source.start(playbackTime);
playbackTime += buffer.duration;
scheduledSources.push(source);
}5. Barge-In Stops Scheduled Audio Cold
When the user speaks over the agent, the server emits reply.done with status: "interrupted" and trims transcript.agent to what was actually spoken. The client’s job is to drop any audio that was scheduled but hasn’t played yet:
function flushPlayback() {
for (const src of scheduledSources) {
try { src.stop(); } catch (_) {}
}
scheduledSources = [];
playbackTime = audioCtx.currentTime;
}That’s the entire interruption story: stop every scheduled AudioBufferSourceNode, reset the cursor to “now.”
Customization
Pick a Different Voice
The voice picker is wired to session.output.voice. The dropdown ships 16 popular options. Eighteen English voices and 16 multilingual voices are available in total. See the Voices catalog for samples of each. Multilingual voices code-switch with English automatically.
Change the Personality with the System Prompt
The textarea is bound to session.system_prompt. Tighten it for shorter replies, give it a persona, or scope it to a specific use case:
You are a customer-support agent for Acme Cloud Storage. Only answer
questions about Acme’s products, plans, and account billing. If the user
asks about anything else, politely redirect.
You can also re-send session.update mid-conversation to swap personas live. Note: greeting and output are immutable after the first apply — only system_prompt, tools, and input can change mid-session.
Tune Turn Detection
Add input.turn_detection to the session.update payload to control how patient the agent feels:
session: {
input: {
turn_detection: {
vad_threshold: 0.5, // 0.0–1.0; higher = less sensitive
min_silence: 600, // ms; min silence before end-of-turn
max_silence: 1500, // ms; hard cap before forcing end-of-turn
interrupt_response: true, // false to disable barge-in entirely
},
},
}
For noisy environments, raise vad_threshold. For deliberate speech (healthcare, eldercare) raise max_silence.
Troubleshooting
Mic blocked or no audio going up. Browsers only allow getUserMedia on secure origins. http://localhost counts; http://your-laptop.local doesn’t. If you’re testing across devices, terminate TLS in front of the Node server (Caddy, Cloudflare Tunnel, ngrok).
Agent keeps interrupting itself. Acoustic echo — the mic is picking up the speakers. Use headphones, or confirm echoCancellation: true is set on getUserMedia.
Audio sounds chipmunky or slowed-down. Sample-rate mismatch. The AudioContext must be created with { sampleRate: 24000 }. If you skip that, the browser creates the context at 44.1 or 48 kHz and the math falls apart.
UNAUTHORIZED close on connect. Token wasn’t included, expired, or was already used. Tokens are single-use — fetch a fresh one for every connection. Confirm ASSEMBLYAI_API_KEY is set on the server.
WebSocket closes with code 1006 and no error. Pre-handshake failure. In browsers, that’s usually a stale or invalid token. Re-fetch the token before reconnecting.
The full troubleshooting guide is in the Voice Agent API docs.
Frequently asked questions
What is AssemblyAI’s Voice Agent API?
AssemblyAI’s Voice Agent API is a single WebSocket endpoint that handles the entire voice agent pipeline server-side — speech-to-text, LLM reasoning, and text-to-speech — so you can build a conversational voice agent without wiring up separate STT, LLM, and TTS providers. It includes neural turn detection, barge-in, tool calling, and 30+ voices out of the box.
How do I use the Voice Agent API from the browser without exposing my API key?
Run a small backend that mints short-lived temporary tokens by calling GET https://agents.assemblyai.com/v1/token with your API key in an Authorization: Bearer header. The browser fetches a fresh token before each WebSocket connection and passes it as ?token= in the URL. Tokens are single-use and expire in 1–600 seconds.
What audio format does the browser need to send?
By default, the Voice Agent API expects audio/pcm — 16-bit signed little-endian PCM at 24,000 Hz, mono, base64-encoded. Create the AudioContext with { sampleRate: 24000 } so no resampling is needed, then use an AudioWorklet to convert Float32 mic samples to Int16Array and base64-encode the buffer.
How do I handle interruption (barge-in)?
When the user speaks while the agent is replying, the server emits reply.done with status: "interrupted". The browser must stop any scheduled AudioBufferSourceNodes and reset its playbackTime cursor to audioCtx.currentTime so the next reply starts cleanly.
Why does my voice agent keep interrupting itself?
Almost always acoustic echo: the mic is picking up the agent’s TTS output through the speakers. Pass echoCancellation: true to getUserMedia to enable the browser’s OS-level acoustic echo cancellation, and prefer headphones during development.
Can the Voice Agent API call tools or functions from the browser?
Yes — tool calling works the same way client-side. Register tool definitions in session.tools on a session.update event. When the agent decides to call a tool, the server emits a tool.call event. Execute the tool in your client code, then send a tool.result event when you receive the next reply.done. See the tool calling guide for the full pattern.
How much does the Voice Agent API cost?
AssemblyAI offers a free tier so you can build and test without a credit card. For current pricing, see the AssemblyAI pricing page.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

