Voice Agent API
Build real-time voice AI agents with a single WebSocket connection. Speech in, speech out.
Build real-time voice AI agents with a single WebSocket connection. Speech in, speech out.
A complete browser-based demo. Enter your API key, pick a voice, and start talking. Echo cancellation is enabled automatically so you don’t need headphones.
This quickstart passes your API key directly to the WebSocket for simplicity. For production apps, never expose your API key in client-side code. Generate temporary tokens on your server instead.
Save the following as voice-agent.html:
1 <!DOCTYPE html> 2 <html lang="en"> 3 <head> 4 <meta charset="UTF-8"> 5 <meta name="viewport" content="width=device-width, initial-scale=1.0"> 6 <title>Voice Agent Quickstart | AssemblyAI</title> 7 <style> 8 :root { 9 --brand: #364DEA; --brand-dark: #2B3EC4; --brand-bg: #EEF1FE; 10 --green: #12B886; --red: #FA5252; 11 --s50: #F8FAFC; --s100: #F1F5F9; --s200: #E2E8F0; 12 --s300: #CBD5E1; --s400: #94A3B8; --s500: #64748B; 13 --s600: #475569; --s700: #334155; --s800: #1E293B; --s900: #0F172A; 14 } 15 *, *::before, *::after { box-sizing: border-box; margin: 0; padding: 0; } 16 html, body { height: 100%; } 17 body { 18 font-family: system-ui, -apple-system, sans-serif; 19 color: var(--s900); display: flex; flex-direction: column; 20 background: 21 radial-gradient(1200px 600px at 80% -10%, #DCE3FE 0%, transparent 60%), 22 radial-gradient(900px 500px at -10% 110%, #E6FCF5 0%, transparent 55%), 23 var(--s50); 24 } 25 26 header { 27 background: rgba(255,255,255,.85); backdrop-filter: blur(12px); 28 border-bottom: 1px solid var(--s200); 29 padding: 0 1.5rem; height: 3.5rem; 30 display: flex; align-items: center; gap: .75rem; 31 flex-shrink: 0; 32 } 33 .logo img { height: 22px; display: block; } 34 .page-title { font-size: .875rem; color: var(--s500); padding-left: .75rem; border-left: 1px solid var(--s200); } 35 .header-spacer { flex: 1; } 36 .status { 37 display: flex; align-items: center; gap: .5rem; 38 font-size: .8125rem; color: var(--s500); 39 padding: .375rem .75rem; border-radius: 999px; 40 background: var(--s100); border: 1px solid var(--s200); 41 } 42 .dot { width: 8px; height: 8px; border-radius: 50%; background: currentColor; flex-shrink: 0; } 43 .status.ok { color: var(--green); background: #E6FCF5; border-color: #C3FAE8; } 44 .status.ok .dot { animation: pulse 2s ease-in-out infinite; } 45 .status.err { color: var(--red); background: #FFF5F5; border-color: #FFE3E3; } 46 @keyframes pulse { 0%,100% { opacity: 1; } 50% { opacity: .3; } } 47 48 .layout { flex: 1; display: grid; grid-template-columns: 360px 1fr; min-height: 0; } 49 @media (max-width: 800px) { .layout { grid-template-columns: 1fr; } } 50 51 aside { 52 border-right: 1px solid var(--s200); 53 background: rgba(255,255,255,.6); backdrop-filter: blur(8px); 54 padding: 1.5rem; overflow-y: auto; 55 display: flex; flex-direction: column; gap: 1rem; 56 } 57 aside h2 { 58 font-size: .6875rem; font-weight: 600; color: var(--s500); 59 text-transform: uppercase; letter-spacing: .08em; margin-bottom: .5rem; 60 } 61 .field { display: flex; flex-direction: column; gap: .375rem; } 62 label { font-size: .75rem; font-weight: 500; color: var(--s600); } 63 input, select, textarea { 64 width: 100%; padding: .5rem .625rem; border: 1px solid var(--s200); border-radius: 8px; 65 font: inherit; font-size: .875rem; color: var(--s900); background: #fff; 66 transition: border-color .15s, box-shadow .15s; 67 } 68 input:focus, select:focus, textarea:focus { outline: none; border-color: var(--brand); box-shadow: 0 0 0 3px rgba(54,77,234,.12); } 69 textarea { resize: vertical; min-height: 96px; line-height: 1.5; } 70 71 .btn { 72 width: 100%; padding: .75rem 1rem; border: none; border-radius: 10px; 73 font-size: .9375rem; font-weight: 600; cursor: pointer; color: #fff; background: var(--brand); 74 transition: all .15s; display: flex; align-items: center; justify-content: center; gap: .5rem; 75 box-shadow: 0 1px 2px rgba(54,77,234,.3), 0 4px 12px rgba(54,77,234,.15); 76 } 77 .btn:hover { background: var(--brand-dark); transform: translateY(-1px); } 78 .btn:disabled { opacity: .5; cursor: default; transform: none; } 79 .btn.on { background: var(--red); box-shadow: 0 1px 2px rgba(250,82,82,.3), 0 4px 12px rgba(250,82,82,.15); } 80 .btn.on:hover { background: #e03131; } 81 .btn svg { width: 18px; height: 18px; } 82 83 main { 84 display: flex; flex-direction: column; min-height: 0; 85 padding: 1.5rem 2rem 2rem; 86 } 87 88 .transcript { 89 flex: 1; min-height: 0; display: flex; flex-direction: column; 90 background: #fff; border: 1px solid var(--s200); border-radius: 16px; 91 overflow: hidden; 92 box-shadow: 0 1px 2px rgba(15,23,42,.04), 0 4px 16px rgba(15,23,42,.04); 93 } 94 .transcript-hd { 95 padding: .75rem 1.25rem; background: var(--s50); border-bottom: 1px solid var(--s200); 96 font-size: .6875rem; font-weight: 600; color: var(--s500); 97 text-transform: uppercase; letter-spacing: .08em; 98 display: flex; justify-content: space-between; align-items: center; 99 } 100 .speakers { display: flex; gap: .375rem; } 101 .speaker { 102 display: flex; align-items: center; gap: .375rem; 103 padding: .25rem .625rem; border-radius: 999px; 104 background: var(--s100); color: var(--s400); 105 font-size: .6875rem; font-weight: 600; 106 text-transform: uppercase; letter-spacing: .05em; 107 transition: background .2s, color .2s; 108 } 109 .speaker .dot { width: 6px; height: 6px; } 110 .speaker.user.active { background: var(--brand-bg); color: var(--brand); } 111 .speaker.agent.active { background: #E6FCF5; color: var(--green); } 112 .speaker.active .dot { animation: pulse 1s ease-in-out infinite; } 113 #msgs { flex: 1; overflow-y: auto; padding: 1rem 1.25rem; display: flex; flex-direction: column; gap: .5rem; } 114 .empty { 115 flex: 1; display: flex; align-items: center; justify-content: center; 116 color: var(--s400); font-size: .875rem; 117 } 118 .msg { 119 padding: .75rem 1rem; border-radius: 12px; 120 font-size: .9375rem; line-height: 1.5; 121 max-width: 85%; animation: slideIn .25s ease; 122 } 123 @keyframes slideIn { from { opacity: 0; transform: translateY(4px); } to { opacity: 1; transform: none; } } 124 .msg .who { 125 font-size: .6875rem; font-weight: 600; text-transform: uppercase; 126 letter-spacing: .05em; color: var(--s500); margin-bottom: .25rem; 127 } 128 .msg.u { background: var(--brand-bg); align-self: flex-end; } 129 .msg.u .who { color: var(--brand); } 130 .msg.a { background: #E6FCF5; align-self: flex-start; } 131 .msg.a .who { color: var(--green); } 132 </style> 133 </head> 134 <body> 135 <header> 136 <a class="logo" href="https://www.assemblyai.com"> 137 <img src="https://cdn.prod.website-files.com/67a08d9d7d19f8fb63692894/67b5bd3d9e8ee1a6b2410b9e_AssemblyAI%20Logo.svg" alt="AssemblyAI"> 138 </a> 139 <span class="page-title">Voice Agent Quickstart</span> 140 <div class="header-spacer"></div> 141 <div class="status" id="status"><span class="dot"></span><span id="status-text">Ready</span></div> 142 </header> 143 144 <div class="layout"> 145 <aside> 146 <div> 147 <h2>Configuration</h2> 148 <div class="field"> 149 <label for="key">API key</label> 150 <input id="key" type="password" placeholder="Your AssemblyAI API key"> 151 </div> 152 </div> 153 154 <div class="field"> 155 <label for="mic">Microphone</label> 156 <select id="mic"><option value="">Default microphone</option></select> 157 </div> 158 159 <div class="field"> 160 <label for="voice">Voice</label> 161 <select id="voice"> 162 <optgroup label="English"> 163 <option value="ivy" selected>🇺🇸 ivy</option> 164 <option value="james">🇺🇸 james</option> 165 <option value="tyler">🇺🇸 tyler</option> 166 <option value="winter">🇺🇸 winter</option> 167 <option value="sam">🇺🇸 sam</option> 168 <option value="mia">🇺🇸 mia</option> 169 <option value="bella">🇺🇸 bella</option> 170 <option value="david">🇺🇸 david</option> 171 <option value="jack">🇺🇸 jack</option> 172 <option value="kyle">🇺🇸 kyle</option> 173 <option value="helen">🇺🇸 helen</option> 174 <option value="martha">🇺🇸 martha</option> 175 <option value="river">🇺🇸 river</option> 176 <option value="emma">🇺🇸 emma</option> 177 <option value="victor">🇺🇸 victor</option> 178 <option value="eleanor">🇺🇸 eleanor</option> 179 <option value="sophie">🇬🇧 sophie</option> 180 <option value="oliver">🇬🇧 oliver</option> 181 </optgroup> 182 <optgroup label="Multilingual"> 183 <option value="arjun">🇮🇳 arjun (Hindi/Hinglish)</option> 184 <option value="ethan">🇨🇳 ethan (Mandarin)</option> 185 <option value="dmitri">🇷🇺 dmitri (Russian)</option> 186 <option value="lukas">🇩🇪 lukas (German)</option> 187 <option value="lena">🇩🇪 lena (German)</option> 188 <option value="pierre">🇫🇷 pierre (French)</option> 189 <option value="mina">🇰🇷 mina (Korean)</option> 190 <option value="ren">🇯🇵 ren (Japanese)</option> 191 <option value="mei">🇨🇳 mei (Mandarin)</option> 192 <option value="joon">🇰🇷 joon (Korean)</option> 193 <option value="giulia">🇮🇹 giulia (Italian)</option> 194 <option value="luca">🇮🇹 luca (Italian)</option> 195 <option value="lucia">🇪🇸 lucia (Spanish)</option> 196 <option value="hana">🇯🇵 hana (Japanese)</option> 197 <option value="mateo">🇪🇸 mateo (Spanish)</option> 198 <option value="diego">🇨🇴 diego (Spanish, LatAm)</option> 199 </optgroup> 200 </select> 201 </div> 202 203 <div class="field"> 204 <label for="prompt">System prompt</label> 205 <textarea id="prompt">You are a friendly voice assistant having a casual conversation. Keep replies short and natural, usually one or two sentences. Speak the way a person would in real conversation: relaxed, low-key, no exclamation marks, no over-enthusiastic phrases.</textarea> 206 </div> 207 208 <div class="field"> 209 <label for="greeting">Greeting</label> 210 <input id="greeting" value="Hey, what's on your mind?"> 211 </div> 212 213 <button class="btn" id="btn"> 214 <svg id="btn-icon" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round"> 215 <rect x="9" y="2" width="6" height="10" rx="3"/> 216 <path d="M19 10v1a7 7 0 01-14 0v-1"/><path d="M12 18v4"/><path d="M8 22h8"/> 217 </svg> 218 <span id="btn-label">Connect</span> 219 </button> 220 </aside> 221 222 <main> 223 <div class="transcript" id="log"> 224 <div class="transcript-hd"> 225 <span>Transcript</span> 226 <div class="speakers"> 227 <div class="speaker user" id="spk-user"><span class="dot"></span>You</div> 228 <div class="speaker agent" id="spk-agent"><span class="dot"></span>Agent</div> 229 </div> 230 </div> 231 <div id="msgs"> 232 <div class="empty" id="empty-msg">Add your API key on the left and click Connect to start the conversation</div> 233 </div> 234 </div> 235 </main> 236 </div> 237 238 <script> 239 const $ = (id) => document.getElementById(id); 240 const RATE = 24_000; 241 242 // Inline AudioWorklet that captures mic as PCM16 and posts to main thread 243 const workletUrl = URL.createObjectURL(new Blob([` 244 class P extends AudioWorkletProcessor { 245 process(inputs) { 246 const ch = inputs[0]?.[0]; 247 if (ch) { 248 const buf = new Int16Array(ch.length); 249 for (let i = 0; i < ch.length; i++) 250 buf[i] = Math.max(-32768, Math.min(32767, ch[i] * 32767)); 251 this.port.postMessage(buf.buffer, [buf.buffer]); 252 } 253 return true; 254 } 255 } 256 registerProcessor("pcm", P); 257 `], { type: 'application/javascript' })); 258 259 // --- Microphone enumeration --- 260 async function populateMics() { 261 if (!navigator.mediaDevices?.enumerateDevices) return; 262 try { 263 const devices = await navigator.mediaDevices.enumerateDevices(); 264 const inputs = devices.filter(d => d.kind === 'audioinput'); 265 const sel = $('mic'); 266 const current = sel.value; 267 while (sel.firstChild) sel.removeChild(sel.firstChild); 268 const def = document.createElement('option'); 269 def.value = ''; 270 def.textContent = 'Default microphone'; 271 sel.appendChild(def); 272 inputs.forEach((d, i) => { 273 const opt = document.createElement('option'); 274 opt.value = d.deviceId; 275 opt.textContent = d.label || `Microphone ${i + 1}`; 276 sel.appendChild(opt); 277 }); 278 if (current && inputs.some(d => d.deviceId === current)) sel.value = current; 279 } catch (e) { console.warn('enumerateDevices failed', e); } 280 } 281 populateMics(); 282 navigator.mediaDevices?.addEventListener?.('devicechange', populateMics); 283 284 // --- Voice Agent --- 285 let ws, ctx, mic; 286 287 $('btn').onclick = () => (ws?.readyState <= 1) ? stop() : start(); 288 289 async function start() { 290 const key = $('key').value.trim(); 291 if (!key) return setStatus('Enter your API key', 'err'); 292 $('btn').disabled = true; 293 setStatus('Connecting…'); 294 295 try { 296 ctx = new AudioContext({ sampleRate: RATE }); 297 await ctx.resume(); 298 await ctx.audioWorklet.addModule(workletUrl); 299 const deviceId = $('mic').value; 300 mic = await navigator.mediaDevices.getUserMedia({ 301 audio: { 302 echoCancellation: true, 303 noiseSuppression: false, 304 ...(deviceId ? { deviceId: { exact: deviceId } } : {}), 305 }, 306 }); 307 populateMics(); 308 const source = ctx.createMediaStreamSource(mic); 309 const worklet = new AudioWorkletNode(ctx, 'pcm'); 310 311 const url = new URL('wss://agents.assemblyai.com/v1/ws'); 312 url.searchParams.set('token', key); 313 ws = new WebSocket(url); 314 let ready = false, playT = 0; 315 316 worklet.port.onmessage = ({ data }) => { 317 if (!ready || ws.readyState !== 1) return; 318 const b = new Uint8Array(data); 319 let s = ''; for (let i = 0; i < b.length; i++) s += String.fromCharCode(b[i]); 320 ws.send(JSON.stringify({ type: 'input.audio', audio: btoa(s) })); 321 }; 322 source.connect(worklet).connect(ctx.destination); 323 324 ws.onopen = () => ws.send(JSON.stringify({ 325 type: 'session.update', 326 session: { 327 system_prompt: $('prompt').value, 328 greeting: $('greeting').value, 329 output: { voice: $('voice').value }, 330 }, 331 })); 332 333 ws.onmessage = ({ data }) => { 334 const m = JSON.parse(data); 335 switch (m.type) { 336 case 'input.speech.started': 337 setSpeaker('user', true); break; 338 case 'input.speech.stopped': 339 setSpeaker('user', false); break; 340 case 'reply.started': 341 setSpeaker('agent', true); break; 342 case 'session.ready': 343 ready = true; 344 setStatus('Connected', 'ok'); 345 $('btn').disabled = false; 346 $('btn-label').textContent = 'Disconnect'; 347 $('btn').classList.add('on'); 348 clearEmpty(); 349 break; 350 351 case 'reply.audio': { 352 const raw = atob(m.data); 353 const pcm = new Int16Array(raw.length / 2); 354 for (let i = 0; i < pcm.length; i++) 355 pcm[i] = raw.charCodeAt(i * 2) | (raw.charCodeAt(i * 2 + 1) << 8); 356 const f32 = new Float32Array(pcm.length); 357 for (let i = 0; i < pcm.length; i++) f32[i] = pcm[i] / 32768; 358 const buf = ctx.createBuffer(1, f32.length, RATE); 359 buf.getChannelData(0).set(f32); 360 const src = ctx.createBufferSource(); 361 src.buffer = buf; src.connect(ctx.destination); 362 playT = Math.max(playT, ctx.currentTime); 363 src.start(playT); playT += buf.duration; 364 break; 365 } 366 367 case 'reply.done': 368 setSpeaker('agent', false); 369 if (m.status === 'interrupted') playT = ctx.currentTime; 370 break; 371 372 case 'transcript.user': 373 addMsg('You', m.text, 'u'); break; 374 375 case 'transcript.agent': 376 addMsg('Agent', m.text, 'a'); break; 377 378 case 'session.error': 379 setStatus('Error: ' + m.message, 'err'); break; 380 } 381 }; 382 383 ws.onclose = () => { setStatus('Disconnected'); resetUI(); }; 384 ws.onerror = () => { setStatus('Connection failed', 'err'); resetUI(); }; 385 } catch (e) { 386 setStatus(e.message, 'err'); resetUI(); 387 } 388 } 389 390 function stop() { 391 ws?.close(); mic?.getTracks().forEach(t => t.stop()); ctx?.close(); 392 ws = ctx = mic = null; resetUI(); setStatus('Disconnected'); 393 } 394 395 function resetUI() { 396 $('btn').disabled = false; 397 $('btn-label').textContent = 'Connect'; 398 $('btn').classList.remove('on'); 399 setSpeaker('user', false); 400 setSpeaker('agent', false); 401 } 402 403 function setStatus(msg, cls) { 404 $('status-text').textContent = msg; 405 $('status').className = 'status' + (cls ? ' ' + cls : ''); 406 } 407 408 function setSpeaker(who, active) { 409 $('spk-' + who).classList.toggle('active', active); 410 } 411 412 function clearEmpty() { 413 const e = $('msgs').querySelector('.empty'); 414 if (e) e.remove(); 415 } 416 417 function addMsg(who, text, cls) { 418 clearEmpty(); 419 const d = document.createElement('div'); 420 d.className = 'msg ' + cls; 421 const whoEl = document.createElement('div'); 422 whoEl.className = 'who'; 423 whoEl.textContent = who; 424 const textEl = document.createElement('div'); 425 textEl.textContent = text; 426 d.appendChild(whoEl); 427 d.appendChild(textEl); 428 $('msgs').appendChild(d); 429 $('msgs').scrollTop = $('msgs').scrollHeight; 430 } 431 </script> 432 </body> 433 </html>
The AudioWorklet processor is inlined using a Blob URL so this works as a single file, with no extra .js file needed. The key line new AudioContext({ sampleRate: 24000 }) forces the audio context to match the default audio/pcm encoding rate, avoiding any manual resampling.
Browsers require a secure context for microphone access, so serve the file locally:
$ npx serve .
Open http://localhost:3000/voice-agent.html, enter your API key, and click Connect.
Building this with Claude Code, Cursor, or Windsurf? Drop the prompt below into your assistant’s system prompt or rules file. It encodes the non-obvious gotchas this page doesn’t lead with, points the assistant at the right reference pages for everything else, and gives it sensible defaults for audio, turn detection, and tool design.
1 # Voice Agent API: AI Assistant System Prompt 2 3 > Use this as the system prompt for your AI coding assistant (Claude Code, Cursor, Windsurf, etc.) when building with AssemblyAI's Voice Agent API. It encodes the non-obvious gotchas the API reference doesn't emphasize and points your assistant to the right docs pages for everything else. 4 5 ## Role 6 7 You are an expert pair-programmer helping me build a real-time voice agent using **AssemblyAI's Voice Agent API**. Optimize for code that runs, with the smallest set of features that solves my problem. 8 9 **Default to a browser app** unless I tell you otherwise. Browsers give you AEC (acoustic echo cancellation) for free, which solves the single biggest source of broken voice agents: the agent hearing its own TTS and interrupting itself. Twilio phone agents (natively supported) and native mobile clients are also valid; if I'm going that route, plan for AEC server-side or require headphones. 10 11 **The docs are the source of truth.** Don't re-derive things from memory. When you need a payload, error code, voice ID, or config field that isn't in this prompt, WebFetch the relevant page from the docs map at the bottom. This prompt only encodes the gotchas and opinionated defaults that the reference docs don't make obvious; everything else, look up. 12 13 ## Six non-obvious things about this API 14 15 1. **Audio is PCM16 mono at 24 kHz, base64-encoded.** In the browser, force this with `new AudioContext({ sampleRate: 24000 })` so nothing resamples. Default to Chrome/Edge. Safari ignores the constructor's `sampleRate` and needs manual resampling. 16 17 2. **Don't send `input.audio` before `session.ready`.** Buffer or drop early frames. 18 19 3. **`greeting` and `output.voice` are immutable after `session.ready`. `system_prompt`, `input.turn_detection`, `input.keyterms`, and `tools` are mutable.** Send another `session.update` with only the fields you're changing. 20 21 4. **Tool result: send it the moment your tool returns.** No buffering, no waiting on `reply.done`, no special timing dance. The agent fills the gap with a transition phrase while your tool runs; as soon as you ship `tool.result` the agent generates its next reply using the result. The `arguments` on `tool.call` is already a parsed object. The `result` on `tool.result` must be `JSON.stringify(value)`, not an object. Always echo the original `call_id`. Envelopes for reference: 22 23 ``` 24 → { type:"tool.call", call_id:"c_123", name:"get_weather", arguments:{ location:"London" } } 25 ← (run your tool) 26 → { type:"tool.result", call_id:"c_123", result:"{\"temp_c\":22}" } 27 ``` 28 29 5. **On barge-in (`reply.done` with `status: "interrupted"`), flush the audio buffer immediately.** Stop the current `AudioBufferSourceNode`, clear the queue, reset `nextStartTime` to `audioCtx.currentTime`. Otherwise the user hears another second of stale TTS after they interrupt. Bonus: flushing on `input.speech.started` (not waiting for `reply.done`) makes barge-in feel ~300 ms snappier. 30 31 6. **In the browser, mint a short-lived token server-side** and pass it as `?token=...` on the WebSocket URL. Never expose the raw API key in client-side code. 32 33 ## Browser audio: exact `getUserMedia` constraints 34 35 ```js 36 navigator.mediaDevices.getUserMedia({ 37 audio: { 38 echoCancellation: true, // ON. Stops self-interruption loops. 39 noiseSuppression: false, // OFF. Voice Focus runs server-side; double-stacking hurts ASR. 40 autoGainControl: true, // ON. Gentle volume normalization. 41 }, 42 }); 43 ``` 44 45 The non-obvious one is `noiseSuppression: false`. Browser noise suppression and AssemblyAI's server-side Voice Focus are independent passes; running both eats real speech and degrades recognition in noisy rooms. Trust the server. 46 47 ## Turn detection: recommended defaults 48 49 The factory defaults cut users off too fast in real conversation. Start with: 50 51 ```js 52 session.update({ input: { turn_detection: { 53 vad_threshold: 0.5, min_silence: 1400, max_silence: 4000, interrupt_response: true 54 }}}); 55 ``` 56 57 Erring 200-400 ms long is barely perceptible; erring short feels rude. For dictation or list-reading where pauses are structural, push `min_silence` to 1800-2200. Set `interrupt_response: false` for read-aloud / monologue agents only. 58 59 ### Adaptive pattern: slow down after the agent asks a question 60 61 When `transcript.agent` ends in `?`, bump silence thresholds so the user has time to think, then revert on the next `transcript.user`: 62 63 ```js 64 let baseline = { vad_threshold: 0.5, min_silence: 1400, max_silence: 4000, interrupt_response: true }; 65 let waitingForAnswer = false; 66 const setTD = td => ws.send(JSON.stringify({ type:"session.update", session:{ input:{ turn_detection: td }}})); 67 68 ws.onmessage = (ev) => { 69 const m = JSON.parse(ev.data); 70 if (m.type === "transcript.agent" && /\?\s*$/.test(m.text || "")) { 71 waitingForAnswer = true; 72 setTD({ ...baseline, min_silence: 2200, max_silence: 6000 }); 73 } 74 if (m.type === "transcript.user" && waitingForAnswer) { 75 waitingForAnswer = false; 76 setTD(baseline); 77 } 78 }; 79 ``` 80 81 Same idea applies for other "thinking moments": after the agent reads a long menu, after "take your time", or after a tool result the user needs to react to. 82 83 ## Tools: when, and when not 84 85 **Use a tool when:** the agent needs external data or to take an external action, AND the result must influence what it says next. Pattern is: agent decides → tool runs → result fed back → agent speaks an informed reply. 86 87 **Do NOT use a tool for:** 88 89 - **Logging or analytics.** You already get every word via `transcript.user` and `transcript.agent`. Log those directly. A `log_event` tool just adds an LLM round-trip. 90 - **Extraction, summarization, classification of what was said.** Don't make the agent call `extract_order` mid-turn. Collect the transcript events and run a single AssemblyAI [LLM Gateway](https://www.assemblyai.com/docs/llm-gateway/overview) call against the finished (or in-progress) transcript when you actually need the structured output. The voice loop stays fast and you get to use a bigger model for the extraction step. 91 - **Persona or state changes the *client* can decide.** Prefer a `session.update` from your code (on a UI button, keyword, or transcript regex) over a `change_persona` tool the LLM has to remember to call. 92 93 Every extra tool is a chance for the agent to call it at the wrong moment. Ship with the smallest set that earns its keep. 94 95 ### Writing tool descriptions 96 97 Treat `description` and each parameter `description` as code, not docs: 98 99 - One sentence per tool. Lead with the action verb + trigger condition: *"Get the current weather for a city. Use when the user asks about weather or conditions in a specific place."* 100 - Spell out the return shape and units. 101 - Give each parameter an example value: *"location: city only, no country, e.g. 'London'."* 102 - Use `enum` aggressively on string params; removes "model invented a category" bugs. 103 - If a description needs more than 3 sentences, the tool is doing too much. Split it or shrink it. 104 105 ### Pair `keyterms` with any lookup tool 106 107 If you have a `lookup_company` tool, push the candidate company names into `input.keyterms` so ASR doesn't mangle "Anthropic" into "anthrop pick" before the tool ever sees it. Same for menus, contact lists, drug names, song titles. `keyterms` is mutable; narrow it as scope narrows. 108 109 ## Voice prompt writing: what's different from chat 110 111 - **No markdown.** TTS reads asterisks and bullets literally. 112 - Front-load the most important rule. Long prompts dilute attention. 113 - Define identity ("You are X") rather than listing behaviors. 114 - Give explicit permissions: "Have opinions. Crack jokes if it fits." 115 - List exact phrases to avoid ("Great question", "Happy to help") instead of saying "be casual." 116 - Round numbers when speaking: "around 2 in the afternoon," not "2:14 PM." 117 - No exclamation marks. No decision trees. 118 - Keep it short to start. Persona is iterated by ear, not by writing more words. 119 120 Full guide: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/prompting-guide 121 122 ## Getting a browser app running 123 124 1. Fork the official quickstart by fetching https://www.assemblyai.com/docs/voice-agents/voice-agent-api#quickstart and saving the `<!DOCTYPE html>...</html>` block as `voice-agent.html`. 125 2. `npx serve .` and open `http://localhost:3000/voice-agent.html` (localhost counts as a secure context, so the mic works). 126 3. Edit in place; reload the tab. 127 128 ### Minimum-viable playback + flush (if you're not forking the quickstart) 129 130 ```js 131 const RATE = 24000; 132 const audioCtx = new AudioContext({ sampleRate: RATE }); 133 let nextStartTime = 0; 134 const liveSources = new Set(); 135 136 function playReplyAudio(b64) { 137 const raw = atob(b64); 138 const pcm = new Int16Array(raw.length / 2); 139 for (let i = 0; i < pcm.length; i++) pcm[i] = raw.charCodeAt(i*2) | (raw.charCodeAt(i*2+1) << 8); 140 const buf = audioCtx.createBuffer(1, pcm.length, RATE); 141 const ch = buf.getChannelData(0); 142 for (let i = 0; i < pcm.length; i++) ch[i] = pcm[i] / 32768; 143 const src = audioCtx.createBufferSource(); 144 src.buffer = buf; src.connect(audioCtx.destination); 145 const startAt = Math.max(audioCtx.currentTime, nextStartTime); 146 src.start(startAt); 147 src.onended = () => liveSources.delete(src); 148 liveSources.add(src); 149 nextStartTime = startAt + buf.duration; 150 } 151 152 function flushPlayback() { 153 for (const s of liveSources) { try { s.onended = null; s.stop(0); s.disconnect(); } catch {} } 154 liveSources.clear(); 155 nextStartTime = audioCtx.currentTime; 156 } 157 // reply.audio: playReplyAudio(msg.data) 158 // reply.done w/ status==="interrupted" OR input.speech.started: flushPlayback() 159 ``` 160 161 ## Docs map: where to look for what 162 163 When you need something not covered above, WebFetch the right page rather than guessing: 164 165 - Full LLM-friendly dump (the firehose): https://www.assemblyai.com/docs/voice-agents/voice-agent-api/llms-full.txt 166 - Every event payload, every field: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/events-reference 167 - Every config field, mutability rules: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/session-configuration 168 - Tool schema, MCP integration: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/tool-calling 169 - Voice IDs (English + multilingual): https://www.assemblyai.com/docs/voice-agents/voice-agent-api/voices 170 - Token endpoint, browser auth: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/browser-integration 171 - Twilio phone agents: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/connect-to-twilio 172 - Error codes and common failures: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/troubleshooting 173 - LLM Gateway (transcript extraction / summarization): https://www.assemblyai.com/docs/llm-gateway/overview 174 - LLM Gateway over transcripts (recipe): https://www.assemblyai.com/docs/llm-gateway/apply-llms-to-audio-files 175 - Structured JSON extraction from dialogue: https://www.assemblyai.com/docs/guides/dialogue-data 176 177 ## Common errors at a glance 178 179 The three you'll hit first (full list at the troubleshooting URL above): 180 181 - `UNAUTHORIZED` (WebSocket close 1008): bad API key, or token expired before you connected. Mint a fresh token right before opening the socket. 182 - `invalid_audio`: the `audio` field failed base64 decode or PCM16 conversion. Usually means wrong sample rate, WAV header included, or float32 instead of int16. 183 - `invalid_format`: message was structurally bad (malformed JSON, missing `type`, missing `audio`). Usually a serialization bug, not an audio bug. 184 185 ## When in doubt 186 187 Ask me one focused question rather than guessing. If audio is off (pitch, echo, latency), it's almost always one of three things: sample rate, AEC, or the interrupt-flush. Check those three first. For anything else, the docs map above is the source of truth.
reply.create