Voice Agent API

Build real-time voice AI agents with a single WebSocket connection. Speech in, speech out.

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Voice Agent Quickstart | AssemblyAI</title>
  <style>
    :root {
      --brand: #364DEA; --brand-dark: #2B3EC4; --brand-bg: #EEF1FE;
      --green: #12B886; --red: #FA5252;
      --s50: #F8FAFC; --s100: #F1F5F9; --s200: #E2E8F0;
      --s300: #CBD5E1; --s400: #94A3B8; --s500: #64748B;
      --s600: #475569; --s700: #334155; --s800: #1E293B; --s900: #0F172A;
    }
    *, *::before, *::after { box-sizing: border-box; margin: 0; padding: 0; }
    html, body { height: 100%; }
    body {
      font-family: system-ui, -apple-system, sans-serif;
      color: var(--s900); display: flex; flex-direction: column;
      background:
        radial-gradient(1200px 600px at 80% -10%, #DCE3FE 0%, transparent 60%),
        radial-gradient(900px 500px at -10% 110%, #E6FCF5 0%, transparent 55%),
        var(--s50);
    }
    header {
      background: rgba(255,255,255,.85); backdrop-filter: blur(12px);
      border-bottom: 1px solid var(--s200);
      padding: 0 1.5rem; height: 3.5rem;
      display: flex; align-items: center; gap: .75rem;
      flex-shrink: 0;
    }
    .logo img { height: 22px; display: block; }
    .page-title { font-size: .875rem; color: var(--s500); padding-left: .75rem; border-left: 1px solid var(--s200); }
    .header-spacer { flex: 1; }
    .status {
      display: flex; align-items: center; gap: .5rem;
      font-size: .8125rem; color: var(--s500);
      padding: .375rem .75rem; border-radius: 999px;
      background: var(--s100); border: 1px solid var(--s200);
    }
    .dot { width: 8px; height: 8px; border-radius: 50%; background: currentColor; flex-shrink: 0; }
    .status.ok { color: var(--green); background: #E6FCF5; border-color: #C3FAE8; }
    .status.ok .dot { animation: pulse 2s ease-in-out infinite; }
    .status.err { color: var(--red); background: #FFF5F5; border-color: #FFE3E3; }
    @keyframes pulse { 0%,100% { opacity: 1; } 50% { opacity: .3; } }
    .layout { flex: 1; display: grid; grid-template-columns: 360px 1fr; min-height: 0; }
    @media (max-width: 800px) { .layout { grid-template-columns: 1fr; } }
    aside {
      border-right: 1px solid var(--s200);
      background: rgba(255,255,255,.6); backdrop-filter: blur(8px);
      padding: 1.5rem; overflow-y: auto;
      display: flex; flex-direction: column; gap: 1rem;
    }
    aside h2 {
      font-size: .6875rem; font-weight: 600; color: var(--s500);
      text-transform: uppercase; letter-spacing: .08em; margin-bottom: .5rem;
    }
    .field { display: flex; flex-direction: column; gap: .375rem; }
    label { font-size: .75rem; font-weight: 500; color: var(--s600); }
    input, select, textarea {
      width: 100%; padding: .5rem .625rem; border: 1px solid var(--s200); border-radius: 8px;
      font: inherit; font-size: .875rem; color: var(--s900); background: #fff;
      transition: border-color .15s, box-shadow .15s;
    }
    input:focus, select:focus, textarea:focus { outline: none; border-color: var(--brand); box-shadow: 0 0 0 3px rgba(54,77,234,.12); }
    textarea { resize: vertical; min-height: 96px; line-height: 1.5; }
    .btn {
      width: 100%; padding: .75rem 1rem; border: none; border-radius: 10px;
      font-size: .9375rem; font-weight: 600; cursor: pointer; color: #fff; background: var(--brand);
      transition: all .15s; display: flex; align-items: center; justify-content: center; gap: .5rem;
      box-shadow: 0 1px 2px rgba(54,77,234,.3), 0 4px 12px rgba(54,77,234,.15);
    }
    .btn:hover { background: var(--brand-dark); transform: translateY(-1px); }
    .btn:disabled { opacity: .5; cursor: default; transform: none; }
    .btn.on { background: var(--red); box-shadow: 0 1px 2px rgba(250,82,82,.3), 0 4px 12px rgba(250,82,82,.15); }
    .btn.on:hover { background: #e03131; }
    .btn svg { width: 18px; height: 18px; }
    main {
      display: flex; flex-direction: column; min-height: 0;
      padding: 1.5rem 2rem 2rem;
    }
    .transcript {
      flex: 1; min-height: 0; display: flex; flex-direction: column;
      background: #fff; border: 1px solid var(--s200); border-radius: 16px;
      overflow: hidden;
      box-shadow: 0 1px 2px rgba(15,23,42,.04), 0 4px 16px rgba(15,23,42,.04);
    }
    .transcript-hd {
      padding: .75rem 1.25rem; background: var(--s50); border-bottom: 1px solid var(--s200);
      font-size: .6875rem; font-weight: 600; color: var(--s500);
      text-transform: uppercase; letter-spacing: .08em;
      display: flex; justify-content: space-between; align-items: center;
    }
    .speakers { display: flex; gap: .375rem; }
    .speaker {
      display: flex; align-items: center; gap: .375rem;
      padding: .25rem .625rem; border-radius: 999px;
      background: var(--s100); color: var(--s400);
      font-size: .6875rem; font-weight: 600;
      text-transform: uppercase; letter-spacing: .05em;
      transition: background .2s, color .2s;
    }
    .speaker .dot { width: 6px; height: 6px; }
    .speaker.user.active { background: var(--brand-bg); color: var(--brand); }
    .speaker.agent.active { background: #E6FCF5; color: var(--green); }
    .speaker.active .dot { animation: pulse 1s ease-in-out infinite; }
    #msgs { flex: 1; overflow-y: auto; padding: 1rem 1.25rem; display: flex; flex-direction: column; gap: .5rem; }
    .empty {
      flex: 1; display: flex; align-items: center; justify-content: center;
      color: var(--s400); font-size: .875rem;
    }
    .msg {
      padding: .75rem 1rem; border-radius: 12px;
      font-size: .9375rem; line-height: 1.5;
      max-width: 85%; animation: slideIn .25s ease;
    }
    @keyframes slideIn { from { opacity: 0; transform: translateY(4px); } to { opacity: 1; transform: none; } }
    .msg .who {
      font-size: .6875rem; font-weight: 600; text-transform: uppercase;
      letter-spacing: .05em; color: var(--s500); margin-bottom: .25rem;
    }
    .msg.u { background: var(--brand-bg); align-self: flex-end; }
    .msg.u .who { color: var(--brand); }
    .msg.a { background: #E6FCF5; align-self: flex-start; }
    .msg.a .who { color: var(--green); }
  </style>
</head>
<body>
<header>
  <a class="logo" href="https://www.assemblyai.com">
    <img src="https://cdn.prod.website-files.com/67a08d9d7d19f8fb63692894/67b5bd3d9e8ee1a6b2410b9e_AssemblyAI%20Logo.svg" alt="AssemblyAI">
  </a>
  <span class="page-title">Voice Agent Quickstart</span>
  <div class="header-spacer"></div>
  <div class="status" id="status"><span class="dot"></span><span id="status-text">Ready</span></div>
</header>
<div class="layout">
  <aside>
    <div>
      <h2>Configuration</h2>
      <div class="field">
        <label for="key">API key</label>
        <input id="key" type="password" placeholder="Your AssemblyAI API key">
      </div>
    </div>
    <div class="field">
      <label for="mic">Microphone</label>
      <select id="mic"><option value="">Default microphone</option></select>
    </div>
    <div class="field">
      <label for="voice">Voice</label>
      <select id="voice">
        <optgroup label="English">
          <option value="ivy" selected>🇺🇸 ivy</option>
          <option value="james">🇺🇸 james</option>
          <option value="tyler">🇺🇸 tyler</option>
          <option value="winter">🇺🇸 winter</option>
          <option value="sam">🇺🇸 sam</option>
          <option value="mia">🇺🇸 mia</option>
          <option value="bella">🇺🇸 bella</option>
          <option value="david">🇺🇸 david</option>
          <option value="jack">🇺🇸 jack</option>
          <option value="kyle">🇺🇸 kyle</option>
          <option value="helen">🇺🇸 helen</option>
          <option value="martha">🇺🇸 martha</option>
          <option value="river">🇺🇸 river</option>
          <option value="emma">🇺🇸 emma</option>
          <option value="victor">🇺🇸 victor</option>
          <option value="eleanor">🇺🇸 eleanor</option>
          <option value="sophie">🇬🇧 sophie</option>
          <option value="oliver">🇬🇧 oliver</option>
        </optgroup>
        <optgroup label="Multilingual">
          <option value="arjun">🇮🇳 arjun (Hindi/Hinglish)</option>
          <option value="ethan">🇨🇳 ethan (Mandarin)</option>
          <option value="dmitri">🇷🇺 dmitri (Russian)</option>
          <option value="lukas">🇩🇪 lukas (German)</option>
          <option value="lena">🇩🇪 lena (German)</option>
          <option value="pierre">🇫🇷 pierre (French)</option>
          <option value="mina">🇰🇷 mina (Korean)</option>
          <option value="ren">🇯🇵 ren (Japanese)</option>
          <option value="mei">🇨🇳 mei (Mandarin)</option>
          <option value="joon">🇰🇷 joon (Korean)</option>
          <option value="giulia">🇮🇹 giulia (Italian)</option>
          <option value="luca">🇮🇹 luca (Italian)</option>
          <option value="lucia">🇪🇸 lucia (Spanish)</option>
          <option value="hana">🇯🇵 hana (Japanese)</option>
          <option value="mateo">🇪🇸 mateo (Spanish)</option>
          <option value="diego">🇨🇴 diego (Spanish, LatAm)</option>
        </optgroup>
      </select>
    </div>
    <div class="field">
      <label for="prompt">System prompt</label>
      <textarea id="prompt">You are a friendly voice assistant having a casual conversation. Keep replies short and natural, usually one or two sentences. Speak the way a person would in real conversation: relaxed, low-key, no exclamation marks, no over-enthusiastic phrases.</textarea>
    </div>
    <div class="field">
      <label for="greeting">Greeting</label>
      <input id="greeting" value="Hey, what's on your mind?">
    </div>
    <button class="btn" id="btn">
      <svg id="btn-icon" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round">
        <rect x="9" y="2" width="6" height="10" rx="3"/>
        <path d="M19 10v1a7 7 0 01-14 0v-1"/><path d="M12 18v4"/><path d="M8 22h8"/>
      </svg>
      <span id="btn-label">Connect</span>
    </button>
  </aside>
  <main>
    <div class="transcript" id="log">
      <div class="transcript-hd">
        <span>Transcript</span>
        <div class="speakers">
          <div class="speaker user" id="spk-user"><span class="dot"></span>You</div>
          <div class="speaker agent" id="spk-agent"><span class="dot"></span>Agent</div>
        </div>
      </div>
      <div id="msgs">
        <div class="empty" id="empty-msg">Add your API key on the left and click Connect to start the conversation</div>
      </div>
    </div>
  </main>
</div>
<script>
const $ = (id) => document.getElementById(id);
const RATE = 24_000;
// Inline AudioWorklet that captures mic as PCM16 and posts to main thread
const workletUrl = URL.createObjectURL(new Blob([`
  class P extends AudioWorkletProcessor {
    process(inputs) {
      const ch = inputs[0]?.[0];
      if (ch) {
        const buf = new Int16Array(ch.length);
        for (let i = 0; i < ch.length; i++)
          buf[i] = Math.max(-32768, Math.min(32767, ch[i] * 32767));
        this.port.postMessage(buf.buffer, [buf.buffer]);
      }
      return true;
    }
  }
  registerProcessor("pcm", P);
`], { type: 'application/javascript' }));
// --- Microphone enumeration ---
async function populateMics() {
  if (!navigator.mediaDevices?.enumerateDevices) return;
  try {
    const devices = await navigator.mediaDevices.enumerateDevices();
    const inputs = devices.filter(d => d.kind === 'audioinput');
    const sel = $('mic');
    const current = sel.value;
    while (sel.firstChild) sel.removeChild(sel.firstChild);
    const def = document.createElement('option');
    def.value = '';
    def.textContent = 'Default microphone';
    sel.appendChild(def);
    inputs.forEach((d, i) => {
      const opt = document.createElement('option');
      opt.value = d.deviceId;
      opt.textContent = d.label || `Microphone ${i + 1}`;
      sel.appendChild(opt);
    });
    if (current && inputs.some(d => d.deviceId === current)) sel.value = current;
  } catch (e) { console.warn('enumerateDevices failed', e); }
}
populateMics();
navigator.mediaDevices?.addEventListener?.('devicechange', populateMics);
// --- Voice Agent ---
let ws, ctx, mic;
$('btn').onclick = () => (ws?.readyState <= 1) ? stop() : start();
async function start() {
  const key = $('key').value.trim();
  if (!key) return setStatus('Enter your API key', 'err');
  $('btn').disabled = true;
  setStatus('Connecting…');
  try {
    ctx = new AudioContext({ sampleRate: RATE });
    await ctx.resume();
    await ctx.audioWorklet.addModule(workletUrl);
    const deviceId = $('mic').value;
    mic = await navigator.mediaDevices.getUserMedia({
      audio: {
        echoCancellation: true,
        noiseSuppression: false,
        ...(deviceId ? { deviceId: { exact: deviceId } } : {}),
      },
    });
    populateMics();
    const source = ctx.createMediaStreamSource(mic);
    const worklet = new AudioWorkletNode(ctx, 'pcm');
    const url = new URL('wss://agents.assemblyai.com/v1/ws');
    url.searchParams.set('token', key);
    ws = new WebSocket(url);
    let ready = false, playT = 0;
    worklet.port.onmessage = ({ data }) => {
      if (!ready || ws.readyState !== 1) return;
      const b = new Uint8Array(data);
      let s = ''; for (let i = 0; i < b.length; i++) s += String.fromCharCode(b[i]);
      ws.send(JSON.stringify({ type: 'input.audio', audio: btoa(s) }));
    };
    source.connect(worklet).connect(ctx.destination);
    ws.onopen = () => ws.send(JSON.stringify({
      type: 'session.update',
      session: {
        system_prompt: $('prompt').value,
        greeting: $('greeting').value,
        output: { voice: $('voice').value },
      },
    }));
    ws.onmessage = ({ data }) => {
      const m = JSON.parse(data);
      switch (m.type) {
        case 'input.speech.started':
          setSpeaker('user', true); break;
        case 'input.speech.stopped':
          setSpeaker('user', false); break;
        case 'reply.started':
          setSpeaker('agent', true); break;
        case 'session.ready':
          ready = true;
          setStatus('Connected', 'ok');
          $('btn').disabled = false;
          $('btn-label').textContent = 'Disconnect';
          $('btn').classList.add('on');
          clearEmpty();
          break;
        case 'reply.audio': {
          const raw = atob(m.data);
          const pcm = new Int16Array(raw.length / 2);
          for (let i = 0; i < pcm.length; i++)
            pcm[i] = raw.charCodeAt(i * 2) | (raw.charCodeAt(i * 2 + 1) << 8);
          const f32 = new Float32Array(pcm.length);
          for (let i = 0; i < pcm.length; i++) f32[i] = pcm[i] / 32768;
          const buf = ctx.createBuffer(1, f32.length, RATE);
          buf.getChannelData(0).set(f32);
          const src = ctx.createBufferSource();
          src.buffer = buf; src.connect(ctx.destination);
          playT = Math.max(playT, ctx.currentTime);
          src.start(playT); playT += buf.duration;
          break;
        }
        case 'reply.done':
          setSpeaker('agent', false);
          if (m.status === 'interrupted') playT = ctx.currentTime;
          break;
        case 'transcript.user':
          addMsg('You', m.text, 'u'); break;
        case 'transcript.agent':
          addMsg('Agent', m.text, 'a'); break;
        case 'session.error':
          setStatus('Error: ' + m.message, 'err'); break;
      }
    };
    ws.onclose = () => { setStatus('Disconnected'); resetUI(); };
    ws.onerror = () => { setStatus('Connection failed', 'err'); resetUI(); };
  } catch (e) {
    setStatus(e.message, 'err'); resetUI();
  }
}
function stop() {
  ws?.close(); mic?.getTracks().forEach(t => t.stop()); ctx?.close();
  ws = ctx = mic = null; resetUI(); setStatus('Disconnected');
}
function resetUI() {
  $('btn').disabled = false;
  $('btn-label').textContent = 'Connect';
  $('btn').classList.remove('on');
  setSpeaker('user', false);
  setSpeaker('agent', false);
}
function setStatus(msg, cls) {
  $('status-text').textContent = msg;
  $('status').className = 'status' + (cls ? ' ' + cls : '');
}
function setSpeaker(who, active) {
  $('spk-' + who).classList.toggle('active', active);
}
function clearEmpty() {
  const e = $('msgs').querySelector('.empty');
  if (e) e.remove();
}
function addMsg(who, text, cls) {
  clearEmpty();
  const d = document.createElement('div');
  d.className = 'msg ' + cls;
  const whoEl = document.createElement('div');
  whoEl.className = 'who';
  whoEl.textContent = who;
  const textEl = document.createElement('div');
  textEl.textContent = text;
  d.appendChild(whoEl);
  d.appendChild(textEl);
  $('msgs').appendChild(d);
  $('msgs').scrollTop = $('msgs').scrollHeight;
}
</script>
</body>
</html>

Pair with your AI coding assistant

Building this with Claude Code, Cursor, or Windsurf? Drop the prompt below into your assistant’s system prompt or rules file. It encodes the non-obvious gotchas this page doesn’t lead with, points the assistant at the right reference pages for everything else, and gives it sensible defaults for audio, turn detection, and tool design.

AI assistant system prompt (copy into Claude Code / Cursor / Windsurf)

1 # Voice Agent API: AI Assistant System Prompt
2 
3 > Use this as the system prompt for your AI coding assistant (Claude Code, Cursor, Windsurf, etc.) when building with AssemblyAI's Voice Agent API. It encodes the non-obvious gotchas the API reference doesn't emphasize and points your assistant to the right docs pages for everything else.
4 
5 ## Role
6 
7 You are an expert pair-programmer helping me build a real-time voice agent using **AssemblyAI's Voice Agent API**. Optimize for code that runs, with the smallest set of features that solves my problem.
8 
9 **Default to a browser app** unless I tell you otherwise. Browsers give you AEC (acoustic echo cancellation) for free, which solves the single biggest source of broken voice agents: the agent hearing its own TTS and interrupting itself. Twilio phone agents (natively supported) and native mobile clients are also valid; if I'm going that route, plan for AEC server-side or require headphones.
10 
11 **The docs are the source of truth.** Don't re-derive things from memory. When you need a payload, error code, voice ID, or config field that isn't in this prompt, WebFetch the relevant page from the docs map at the bottom. This prompt only encodes the gotchas and opinionated defaults that the reference docs don't make obvious; everything else, look up.
12 
13 ## Six non-obvious things about this API
14 
15 1. **Audio is PCM16 mono at 24 kHz, base64-encoded.** In the browser, force this with `new AudioContext({ sampleRate: 24000 })` so nothing resamples. Default to Chrome/Edge. Safari ignores the constructor's `sampleRate` and needs manual resampling.
16 
17 2. **Don't send `input.audio` before `session.ready`.** Buffer or drop early frames.
18 
19 3. **`greeting` and `output.voice` are immutable after `session.ready`. `system_prompt`, `input.turn_detection`, `input.keyterms`, and `tools` are mutable.** Send another `session.update` with only the fields you're changing.
20 
21 4. **Tool result: send it the moment your tool returns.** No buffering, no waiting on `reply.done`, no special timing dance. The agent fills the gap with a transition phrase while your tool runs; as soon as you ship `tool.result` the agent generates its next reply using the result. The `arguments` on `tool.call` is already a parsed object. The `result` on `tool.result` must be `JSON.stringify(value)`, not an object. Always echo the original `call_id`. Envelopes for reference:
22 
23    ```
24    → { type:"tool.call",   call_id:"c_123", name:"get_weather", arguments:{ location:"London" } }
25    ← (run your tool)
26    → { type:"tool.result", call_id:"c_123", result:"{\"temp_c\":22}" }
27    ```
28 
29 5. **On barge-in (`reply.done` with `status: "interrupted"`), flush the audio buffer immediately.** Stop the current `AudioBufferSourceNode`, clear the queue, reset `nextStartTime` to `audioCtx.currentTime`. Otherwise the user hears another second of stale TTS after they interrupt. Bonus: flushing on `input.speech.started` (not waiting for `reply.done`) makes barge-in feel ~300 ms snappier.
30 
31 6. **In the browser, mint a short-lived token server-side** and pass it as `?token=...` on the WebSocket URL. Never expose the raw API key in client-side code.
32 
33 ## Browser audio: exact `getUserMedia` constraints
34 
35 ```js
36 navigator.mediaDevices.getUserMedia({
37   audio: {
38     echoCancellation: true,    // ON. Stops self-interruption loops.
39     noiseSuppression: false,   // OFF. Voice Focus runs server-side; double-stacking hurts ASR.
40     autoGainControl: true,     // ON. Gentle volume normalization.
41   },
42 });
43 ```
44 
45 The non-obvious one is `noiseSuppression: false`. Browser noise suppression and AssemblyAI's server-side Voice Focus are independent passes; running both eats real speech and degrades recognition in noisy rooms. Trust the server.
46 
47 ## Turn detection: recommended defaults
48 
49 The factory defaults cut users off too fast in real conversation. Start with:
50 
51 ```js
52 session.update({ input: { turn_detection: {
53   vad_threshold: 0.5, min_silence: 1400, max_silence: 4000, interrupt_response: true
54 }}});
55 ```
56 
57 Erring 200-400 ms long is barely perceptible; erring short feels rude. For dictation or list-reading where pauses are structural, push `min_silence` to 1800-2200. Set `interrupt_response: false` for read-aloud / monologue agents only.
58 
59 ### Adaptive pattern: slow down after the agent asks a question
60 
61 When `transcript.agent` ends in `?`, bump silence thresholds so the user has time to think, then revert on the next `transcript.user`:
62 
63 ```js
64 let baseline = { vad_threshold: 0.5, min_silence: 1400, max_silence: 4000, interrupt_response: true };
65 let waitingForAnswer = false;
66 const setTD = td => ws.send(JSON.stringify({ type:"session.update", session:{ input:{ turn_detection: td }}}));
67 
68 ws.onmessage = (ev) => {
69   const m = JSON.parse(ev.data);
70   if (m.type === "transcript.agent" && /\?\s*$/.test(m.text || "")) {
71     waitingForAnswer = true;
72     setTD({ ...baseline, min_silence: 2200, max_silence: 6000 });
73   }
74   if (m.type === "transcript.user" && waitingForAnswer) {
75     waitingForAnswer = false;
76     setTD(baseline);
77   }
78 };
79 ```
80 
81 Same idea applies for other "thinking moments": after the agent reads a long menu, after "take your time", or after a tool result the user needs to react to.
82 
83 ## Tools: when, and when not
84 
85 **Use a tool when:** the agent needs external data or to take an external action, AND the result must influence what it says next. Pattern is: agent decides → tool runs → result fed back → agent speaks an informed reply.
86 
87 **Do NOT use a tool for:**
88 
89 - **Logging or analytics.** You already get every word via `transcript.user` and `transcript.agent`. Log those directly. A `log_event` tool just adds an LLM round-trip.
90 - **Extraction, summarization, classification of what was said.** Don't make the agent call `extract_order` mid-turn. Collect the transcript events and run a single AssemblyAI [LLM Gateway](https://www.assemblyai.com/docs/llm-gateway/overview) call against the finished (or in-progress) transcript when you actually need the structured output. The voice loop stays fast and you get to use a bigger model for the extraction step.
91 - **Persona or state changes the *client* can decide.** Prefer a `session.update` from your code (on a UI button, keyword, or transcript regex) over a `change_persona` tool the LLM has to remember to call.
92 
93 Every extra tool is a chance for the agent to call it at the wrong moment. Ship with the smallest set that earns its keep.
94 
95 ### Writing tool descriptions
96 
97 Treat `description` and each parameter `description` as code, not docs:
98 
99 - One sentence per tool. Lead with the action verb + trigger condition: *"Get the current weather for a city. Use when the user asks about weather or conditions in a specific place."*
100 - Spell out the return shape and units.
101 - Give each parameter an example value: *"location: city only, no country, e.g. 'London'."*
102 - Use `enum` aggressively on string params; removes "model invented a category" bugs.
103 - If a description needs more than 3 sentences, the tool is doing too much. Split it or shrink it.
104 
105 ### Pair `keyterms` with any lookup tool
106 
107 If you have a `lookup_company` tool, push the candidate company names into `input.keyterms` so ASR doesn't mangle "Anthropic" into "anthrop pick" before the tool ever sees it. Same for menus, contact lists, drug names, song titles. `keyterms` is mutable; narrow it as scope narrows.
108 
109 ## Voice prompt writing: what's different from chat
110 
111 - **No markdown.** TTS reads asterisks and bullets literally.
112 - Front-load the most important rule. Long prompts dilute attention.
113 - Define identity ("You are X") rather than listing behaviors.
114 - Give explicit permissions: "Have opinions. Crack jokes if it fits."
115 - List exact phrases to avoid ("Great question", "Happy to help") instead of saying "be casual."
116 - Round numbers when speaking: "around 2 in the afternoon," not "2:14 PM."
117 - No exclamation marks. No decision trees.
118 - Keep it short to start. Persona is iterated by ear, not by writing more words.
119 
120 Full guide: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/prompting-guide
121 
122 ## Getting a browser app running
123 
124 1. Fork the official quickstart by fetching https://www.assemblyai.com/docs/voice-agents/voice-agent-api#quickstart and saving the `<!DOCTYPE html>...</html>` block as `voice-agent.html`.
125 2. `npx serve .` and open `http://localhost:3000/voice-agent.html` (localhost counts as a secure context, so the mic works).
126 3. Edit in place; reload the tab.
127 
128 ### Minimum-viable playback + flush (if you're not forking the quickstart)
129 
130 ```js
131 const RATE = 24000;
132 const audioCtx = new AudioContext({ sampleRate: RATE });
133 let nextStartTime = 0;
134 const liveSources = new Set();
135 
136 function playReplyAudio(b64) {
137   const raw = atob(b64);
138   const pcm = new Int16Array(raw.length / 2);
139   for (let i = 0; i < pcm.length; i++) pcm[i] = raw.charCodeAt(i*2) | (raw.charCodeAt(i*2+1) << 8);
140   const buf = audioCtx.createBuffer(1, pcm.length, RATE);
141   const ch = buf.getChannelData(0);
142   for (let i = 0; i < pcm.length; i++) ch[i] = pcm[i] / 32768;
143   const src = audioCtx.createBufferSource();
144   src.buffer = buf; src.connect(audioCtx.destination);
145   const startAt = Math.max(audioCtx.currentTime, nextStartTime);
146   src.start(startAt);
147   src.onended = () => liveSources.delete(src);
148   liveSources.add(src);
149   nextStartTime = startAt + buf.duration;
150 }
151 
152 function flushPlayback() {
153   for (const s of liveSources) { try { s.onended = null; s.stop(0); s.disconnect(); } catch {} }
154   liveSources.clear();
155   nextStartTime = audioCtx.currentTime;
156 }
157 // reply.audio: playReplyAudio(msg.data)
158 // reply.done w/ status==="interrupted" OR input.speech.started: flushPlayback()
159 ```
160 
161 ## Docs map: where to look for what
162 
163 When you need something not covered above, WebFetch the right page rather than guessing:
164 
165 - Full LLM-friendly dump (the firehose): https://www.assemblyai.com/docs/voice-agents/voice-agent-api/llms-full.txt
166 - Every event payload, every field: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/events-reference
167 - Every config field, mutability rules: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/session-configuration
168 - Tool schema, MCP integration: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/tool-calling
169 - Voice IDs (English + multilingual): https://www.assemblyai.com/docs/voice-agents/voice-agent-api/voices
170 - Token endpoint, browser auth: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/browser-integration
171 - Twilio phone agents: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/connect-to-twilio
172 - Error codes and common failures: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/troubleshooting
173 - LLM Gateway (transcript extraction / summarization): https://www.assemblyai.com/docs/llm-gateway/overview
174 - LLM Gateway over transcripts (recipe): https://www.assemblyai.com/docs/llm-gateway/apply-llms-to-audio-files
175 - Structured JSON extraction from dialogue: https://www.assemblyai.com/docs/guides/dialogue-data
176 
177 ## Common errors at a glance
178 
179 The three you'll hit first (full list at the troubleshooting URL above):
180 
181 - `UNAUTHORIZED` (WebSocket close 1008): bad API key, or token expired before you connected. Mint a fresh token right before opening the socket.
182 - `invalid_audio`: the `audio` field failed base64 decode or PCM16 conversion. Usually means wrong sample rate, WAV header included, or float32 instead of int16.
183 - `invalid_format`: message was structurally bad (malformed JSON, missing `type`, missing `audio`). Usually a serialization bug, not an audio bug.
184 
185 ## When in doubt
186 
187 Ask me one focused question rather than guessing. If audio is off (pitch, echo, latency), it's almost always one of three things: sample rate, AEC, or the interrupt-flush. Check those three first. For anything else, the docs map above is the source of truth.

Next steps

Configure your agent: system prompt, greeting, voice, tools, turn detection

Events reference: every event with full payloads, plus the session event flow diagram

Tool calling: function calling, interactive vs hold execution, reply.create

Audio format: PCM16 vs G.711, sending and playing audio, interruption flush

Browser integration: temporary tokens for client-side apps

Troubleshooting: symptom-to-fix table and support logging

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Voice Agent Quickstart | AssemblyAI</title>
  <style>
    :root {
      --brand: #364DEA; --brand-dark: #2B3EC4; --brand-bg: #EEF1FE;
      --green: #12B886; --red: #FA5252;
      --s50: #F8FAFC; --s100: #F1F5F9; --s200: #E2E8F0;
      --s300: #CBD5E1; --s400: #94A3B8; --s500: #64748B;
      --s600: #475569; --s700: #334155; --s800: #1E293B; --s900: #0F172A;
    }
    *, *::before, *::after { box-sizing: border-box; margin: 0; padding: 0; }
    html, body { height: 100%; }
    body {
      font-family: system-ui, -apple-system, sans-serif;
      color: var(--s900); display: flex; flex-direction: column;
      background:
        radial-gradient(1200px 600px at 80% -10%, #DCE3FE 0%, transparent 60%),
        radial-gradient(900px 500px at -10% 110%, #E6FCF5 0%, transparent 55%),
        var(--s50);
    }
    header {
      background: rgba(255,255,255,.85); backdrop-filter: blur(12px);
      border-bottom: 1px solid var(--s200);
      padding: 0 1.5rem; height: 3.5rem;
      display: flex; align-items: center; gap: .75rem;
      flex-shrink: 0;
    }
    .logo img { height: 22px; display: block; }
    .page-title { font-size: .875rem; color: var(--s500); padding-left: .75rem; border-left: 1px solid var(--s200); }
    .header-spacer { flex: 1; }
    .status {
      display: flex; align-items: center; gap: .5rem;
      font-size: .8125rem; color: var(--s500);
      padding: .375rem .75rem; border-radius: 999px;
      background: var(--s100); border: 1px solid var(--s200);
    }
    .dot { width: 8px; height: 8px; border-radius: 50%; background: currentColor; flex-shrink: 0; }
    .status.ok { color: var(--green); background: #E6FCF5; border-color: #C3FAE8; }
    .status.ok .dot { animation: pulse 2s ease-in-out infinite; }
    .status.err { color: var(--red); background: #FFF5F5; border-color: #FFE3E3; }
    @keyframes pulse { 0%,100% { opacity: 1; } 50% { opacity: .3; } }
    .layout { flex: 1; display: grid; grid-template-columns: 360px 1fr; min-height: 0; }
    @media (max-width: 800px) { .layout { grid-template-columns: 1fr; } }
    aside {
      border-right: 1px solid var(--s200);
      background: rgba(255,255,255,.6); backdrop-filter: blur(8px);
      padding: 1.5rem; overflow-y: auto;
      display: flex; flex-direction: column; gap: 1rem;
    }
    aside h2 {
      font-size: .6875rem; font-weight: 600; color: var(--s500);
      text-transform: uppercase; letter-spacing: .08em; margin-bottom: .5rem;
    }
    .field { display: flex; flex-direction: column; gap: .375rem; }
    label { font-size: .75rem; font-weight: 500; color: var(--s600); }
    input, select, textarea {
      width: 100%; padding: .5rem .625rem; border: 1px solid var(--s200); border-radius: 8px;
      font: inherit; font-size: .875rem; color: var(--s900); background: #fff;
      transition: border-color .15s, box-shadow .15s;
    }
    input:focus, select:focus, textarea:focus { outline: none; border-color: var(--brand); box-shadow: 0 0 0 3px rgba(54,77,234,.12); }
    textarea { resize: vertical; min-height: 96px; line-height: 1.5; }
    .btn {
      width: 100%; padding: .75rem 1rem; border: none; border-radius: 10px;
      font-size: .9375rem; font-weight: 600; cursor: pointer; color: #fff; background: var(--brand);
      transition: all .15s; display: flex; align-items: center; justify-content: center; gap: .5rem;
      box-shadow: 0 1px 2px rgba(54,77,234,.3), 0 4px 12px rgba(54,77,234,.15);
    }
    .btn:hover { background: var(--brand-dark); transform: translateY(-1px); }
    .btn:disabled { opacity: .5; cursor: default; transform: none; }
    .btn.on { background: var(--red); box-shadow: 0 1px 2px rgba(250,82,82,.3), 0 4px 12px rgba(250,82,82,.15); }
    .btn.on:hover { background: #e03131; }
    .btn svg { width: 18px; height: 18px; }
    main {
      display: flex; flex-direction: column; min-height: 0;
      padding: 1.5rem 2rem 2rem;
    }
    .transcript {
      flex: 1; min-height: 0; display: flex; flex-direction: column;
      background: #fff; border: 1px solid var(--s200); border-radius: 16px;
      overflow: hidden;
      box-shadow: 0 1px 2px rgba(15,23,42,.04), 0 4px 16px rgba(15,23,42,.04);
    }
    .transcript-hd {
      padding: .75rem 1.25rem; background: var(--s50); border-bottom: 1px solid var(--s200);
      font-size: .6875rem; font-weight: 600; color: var(--s500);
      text-transform: uppercase; letter-spacing: .08em;
      display: flex; justify-content: space-between; align-items: center;
    }
    .speakers { display: flex; gap: .375rem; }
    .speaker {
      display: flex; align-items: center; gap: .375rem;
      padding: .25rem .625rem; border-radius: 999px;
      background: var(--s100); color: var(--s400);
      font-size: .6875rem; font-weight: 600;
      text-transform: uppercase; letter-spacing: .05em;
      transition: background .2s, color .2s;
    }
    .speaker .dot { width: 6px; height: 6px; }
    .speaker.user.active { background: var(--brand-bg); color: var(--brand); }
    .speaker.agent.active { background: #E6FCF5; color: var(--green); }
    .speaker.active .dot { animation: pulse 1s ease-in-out infinite; }
    #msgs { flex: 1; overflow-y: auto; padding: 1rem 1.25rem; display: flex; flex-direction: column; gap: .5rem; }
    .empty {
      flex: 1; display: flex; align-items: center; justify-content: center;
      color: var(--s400); font-size: .875rem;
    }
    .msg {
      padding: .75rem 1rem; border-radius: 12px;
      font-size: .9375rem; line-height: 1.5;
      max-width: 85%; animation: slideIn .25s ease;
    }
    @keyframes slideIn { from { opacity: 0; transform: translateY(4px); } to { opacity: 1; transform: none; } }
    .msg .who {
      font-size: .6875rem; font-weight: 600; text-transform: uppercase;
      letter-spacing: .05em; color: var(--s500); margin-bottom: .25rem;
    }
    .msg.u { background: var(--brand-bg); align-self: flex-end; }
    .msg.u .who { color: var(--brand); }
    .msg.a { background: #E6FCF5; align-self: flex-start; }
    .msg.a .who { color: var(--green); }
  </style>
</head>
<body>
<header>
  <a class="logo" href="https://www.assemblyai.com">
    <img src="https://cdn.prod.website-files.com/67a08d9d7d19f8fb63692894/67b5bd3d9e8ee1a6b2410b9e_AssemblyAI%20Logo.svg" alt="AssemblyAI">
  </a>
  <span class="page-title">Voice Agent Quickstart</span>
  <div class="header-spacer"></div>
  <div class="status" id="status"><span class="dot"></span><span id="status-text">Ready</span></div>
</header>
<div class="layout">
  <aside>
    <div>
      <h2>Configuration</h2>
      <div class="field">
        <label for="key">API key</label>
        <input id="key" type="password" placeholder="Your AssemblyAI API key">
      </div>
    </div>
    <div class="field">
      <label for="mic">Microphone</label>
      <select id="mic"><option value="">Default microphone</option></select>
    </div>
    <div class="field">
      <label for="voice">Voice</label>
      <select id="voice">
        <optgroup label="English">
          <option value="ivy" selected>🇺🇸 ivy</option>
          <option value="james">🇺🇸 james</option>
          <option value="tyler">🇺🇸 tyler</option>
          <option value="winter">🇺🇸 winter</option>
          <option value="sam">🇺🇸 sam</option>
          <option value="mia">🇺🇸 mia</option>
          <option value="bella">🇺🇸 bella</option>
          <option value="david">🇺🇸 david</option>
          <option value="jack">🇺🇸 jack</option>
          <option value="kyle">🇺🇸 kyle</option>
          <option value="helen">🇺🇸 helen</option>
          <option value="martha">🇺🇸 martha</option>
          <option value="river">🇺🇸 river</option>
          <option value="emma">🇺🇸 emma</option>
          <option value="victor">🇺🇸 victor</option>
          <option value="eleanor">🇺🇸 eleanor</option>
          <option value="sophie">🇬🇧 sophie</option>
          <option value="oliver">🇬🇧 oliver</option>
        </optgroup>
        <optgroup label="Multilingual">
          <option value="arjun">🇮🇳 arjun (Hindi/Hinglish)</option>
          <option value="ethan">🇨🇳 ethan (Mandarin)</option>
          <option value="dmitri">🇷🇺 dmitri (Russian)</option>
          <option value="lukas">🇩🇪 lukas (German)</option>
          <option value="lena">🇩🇪 lena (German)</option>
          <option value="pierre">🇫🇷 pierre (French)</option>
          <option value="mina">🇰🇷 mina (Korean)</option>
          <option value="ren">🇯🇵 ren (Japanese)</option>
          <option value="mei">🇨🇳 mei (Mandarin)</option>
          <option value="joon">🇰🇷 joon (Korean)</option>
          <option value="giulia">🇮🇹 giulia (Italian)</option>
          <option value="luca">🇮🇹 luca (Italian)</option>
          <option value="lucia">🇪🇸 lucia (Spanish)</option>
          <option value="hana">🇯🇵 hana (Japanese)</option>
          <option value="mateo">🇪🇸 mateo (Spanish)</option>
          <option value="diego">🇨🇴 diego (Spanish, LatAm)</option>
        </optgroup>
      </select>
    </div>
    <div class="field">
      <label for="prompt">System prompt</label>
      <textarea id="prompt">You are a friendly voice assistant having a casual conversation. Keep replies short and natural, usually one or two sentences. Speak the way a person would in real conversation: relaxed, low-key, no exclamation marks, no over-enthusiastic phrases.</textarea>
    </div>
    <div class="field">
      <label for="greeting">Greeting</label>
      <input id="greeting" value="Hey, what's on your mind?">
    </div>
    <button class="btn" id="btn">
      <svg id="btn-icon" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round">
        <rect x="9" y="2" width="6" height="10" rx="3"/>
        <path d="M19 10v1a7 7 0 01-14 0v-1"/><path d="M12 18v4"/><path d="M8 22h8"/>
      </svg>
      <span id="btn-label">Connect</span>
    </button>
  </aside>
  <main>
    <div class="transcript" id="log">
      <div class="transcript-hd">
        <span>Transcript</span>
        <div class="speakers">
          <div class="speaker user" id="spk-user"><span class="dot"></span>You</div>
          <div class="speaker agent" id="spk-agent"><span class="dot"></span>Agent</div>
        </div>
      </div>
      <div id="msgs">
        <div class="empty" id="empty-msg">Add your API key on the left and click Connect to start the conversation</div>
      </div>
    </div>
  </main>
</div>
<script>
const $ = (id) => document.getElementById(id);
const RATE = 24_000;
// Inline AudioWorklet that captures mic as PCM16 and posts to main thread
const workletUrl = URL.createObjectURL(new Blob([`
  class P extends AudioWorkletProcessor {
    process(inputs) {
      const ch = inputs[0]?.[0];
      if (ch) {
        const buf = new Int16Array(ch.length);
        for (let i = 0; i < ch.length; i++)
          buf[i] = Math.max(-32768, Math.min(32767, ch[i] * 32767));
        this.port.postMessage(buf.buffer, [buf.buffer]);
      }
      return true;
    }
  }
  registerProcessor("pcm", P);
`], { type: 'application/javascript' }));
// --- Microphone enumeration ---
async function populateMics() {
  if (!navigator.mediaDevices?.enumerateDevices) return;
  try {
    const devices = await navigator.mediaDevices.enumerateDevices();
    const inputs = devices.filter(d => d.kind === 'audioinput');
    const sel = $('mic');
    const current = sel.value;
    while (sel.firstChild) sel.removeChild(sel.firstChild);
    const def = document.createElement('option');
    def.value = '';
    def.textContent = 'Default microphone';
    sel.appendChild(def);
    inputs.forEach((d, i) => {
      const opt = document.createElement('option');
      opt.value = d.deviceId;
      opt.textContent = d.label || `Microphone ${i + 1}`;
      sel.appendChild(opt);
    });
    if (current && inputs.some(d => d.deviceId === current)) sel.value = current;
  } catch (e) { console.warn('enumerateDevices failed', e); }
}
populateMics();
navigator.mediaDevices?.addEventListener?.('devicechange', populateMics);
// --- Voice Agent ---
let ws, ctx, mic;
$('btn').onclick = () => (ws?.readyState <= 1) ? stop() : start();
async function start() {
  const key = $('key').value.trim();
  if (!key) return setStatus('Enter your API key', 'err');
  $('btn').disabled = true;
  setStatus('Connecting…');
  try {
    ctx = new AudioContext({ sampleRate: RATE });
    await ctx.resume();
    await ctx.audioWorklet.addModule(workletUrl);
    const deviceId = $('mic').value;
    mic = await navigator.mediaDevices.getUserMedia({
      audio: {
        echoCancellation: true,
        noiseSuppression: false,
        ...(deviceId ? { deviceId: { exact: deviceId } } : {}),
      },
    });
    populateMics();
    const source = ctx.createMediaStreamSource(mic);
    const worklet = new AudioWorkletNode(ctx, 'pcm');
    const url = new URL('wss://agents.assemblyai.com/v1/ws');
    url.searchParams.set('token', key);
    ws = new WebSocket(url);
    let ready = false, playT = 0;
    worklet.port.onmessage = ({ data }) => {
      if (!ready || ws.readyState !== 1) return;
      const b = new Uint8Array(data);
      let s = ''; for (let i = 0; i < b.length; i++) s += String.fromCharCode(b[i]);
      ws.send(JSON.stringify({ type: 'input.audio', audio: btoa(s) }));
    };
    source.connect(worklet).connect(ctx.destination);
    ws.onopen = () => ws.send(JSON.stringify({
      type: 'session.update',
      session: {
        system_prompt: $('prompt').value,
        greeting: $('greeting').value,
        output: { voice: $('voice').value },
      },
    }));
    ws.onmessage = ({ data }) => {
      const m = JSON.parse(data);
      switch (m.type) {
        case 'input.speech.started':
          setSpeaker('user', true); break;
        case 'input.speech.stopped':
          setSpeaker('user', false); break;
        case 'reply.started':
          setSpeaker('agent', true); break;
        case 'session.ready':
          ready = true;
          setStatus('Connected', 'ok');
          $('btn').disabled = false;
          $('btn-label').textContent = 'Disconnect';
          $('btn').classList.add('on');
          clearEmpty();
          break;
        case 'reply.audio': {
          const raw = atob(m.data);
          const pcm = new Int16Array(raw.length / 2);
          for (let i = 0; i < pcm.length; i++)
            pcm[i] = raw.charCodeAt(i * 2) | (raw.charCodeAt(i * 2 + 1) << 8);
          const f32 = new Float32Array(pcm.length);
          for (let i = 0; i < pcm.length; i++) f32[i] = pcm[i] / 32768;
          const buf = ctx.createBuffer(1, f32.length, RATE);
          buf.getChannelData(0).set(f32);
          const src = ctx.createBufferSource();
          src.buffer = buf; src.connect(ctx.destination);
          playT = Math.max(playT, ctx.currentTime);
          src.start(playT); playT += buf.duration;
          break;
        }
        case 'reply.done':
          setSpeaker('agent', false);
          if (m.status === 'interrupted') playT = ctx.currentTime;
          break;
        case 'transcript.user':
          addMsg('You', m.text, 'u'); break;
        case 'transcript.agent':
          addMsg('Agent', m.text, 'a'); break;
        case 'session.error':
          setStatus('Error: ' + m.message, 'err'); break;
      }
    };
    ws.onclose = () => { setStatus('Disconnected'); resetUI(); };
    ws.onerror = () => { setStatus('Connection failed', 'err'); resetUI(); };
  } catch (e) {
    setStatus(e.message, 'err'); resetUI();
  }
}
function stop() {
  ws?.close(); mic?.getTracks().forEach(t => t.stop()); ctx?.close();
  ws = ctx = mic = null; resetUI(); setStatus('Disconnected');
}
function resetUI() {
  $('btn').disabled = false;
  $('btn-label').textContent = 'Connect';
  $('btn').classList.remove('on');
  setSpeaker('user', false);
  setSpeaker('agent', false);
}
function setStatus(msg, cls) {
  $('status-text').textContent = msg;
  $('status').className = 'status' + (cls ? ' ' + cls : '');
}
function setSpeaker(who, active) {
  $('spk-' + who).classList.toggle('active', active);
}
function clearEmpty() {
  const e = $('msgs').querySelector('.empty');
  if (e) e.remove();
}
function addMsg(who, text, cls) {
  clearEmpty();
  const d = document.createElement('div');
  d.className = 'msg ' + cls;
  const whoEl = document.createElement('div');
  whoEl.className = 'who';
  whoEl.textContent = who;
  const textEl = document.createElement('div');
  textEl.textContent = text;
  d.appendChild(whoEl);
  d.appendChild(textEl);
  $('msgs').appendChild(d);
  $('msgs').scrollTop = $('msgs').scrollHeight;
}
</script>
</body>
</html>

Pair with your AI coding assistant

AI assistant system prompt (copy into Claude Code / Cursor / Windsurf)

1 # Voice Agent API: AI Assistant System Prompt
2 
3 > Use this as the system prompt for your AI coding assistant (Claude Code, Cursor, Windsurf, etc.) when building with AssemblyAI's Voice Agent API. It encodes the non-obvious gotchas the API reference doesn't emphasize and points your assistant to the right docs pages for everything else.
4 
5 ## Role
6 
7 You are an expert pair-programmer helping me build a real-time voice agent using **AssemblyAI's Voice Agent API**. Optimize for code that runs, with the smallest set of features that solves my problem.
8 
9 **Default to a browser app** unless I tell you otherwise. Browsers give you AEC (acoustic echo cancellation) for free, which solves the single biggest source of broken voice agents: the agent hearing its own TTS and interrupting itself. Twilio phone agents (natively supported) and native mobile clients are also valid; if I'm going that route, plan for AEC server-side or require headphones.
10 
11 **The docs are the source of truth.** Don't re-derive things from memory. When you need a payload, error code, voice ID, or config field that isn't in this prompt, WebFetch the relevant page from the docs map at the bottom. This prompt only encodes the gotchas and opinionated defaults that the reference docs don't make obvious; everything else, look up.
12 
13 ## Six non-obvious things about this API
14 
15 1. **Audio is PCM16 mono at 24 kHz, base64-encoded.** In the browser, force this with `new AudioContext({ sampleRate: 24000 })` so nothing resamples. Default to Chrome/Edge. Safari ignores the constructor's `sampleRate` and needs manual resampling.
16 
17 2. **Don't send `input.audio` before `session.ready`.** Buffer or drop early frames.
18 
19 3. **`greeting` and `output.voice` are immutable after `session.ready`. `system_prompt`, `input.turn_detection`, `input.keyterms`, and `tools` are mutable.** Send another `session.update` with only the fields you're changing.
20 
21 4. **Tool result: send it the moment your tool returns.** No buffering, no waiting on `reply.done`, no special timing dance. The agent fills the gap with a transition phrase while your tool runs; as soon as you ship `tool.result` the agent generates its next reply using the result. The `arguments` on `tool.call` is already a parsed object. The `result` on `tool.result` must be `JSON.stringify(value)`, not an object. Always echo the original `call_id`. Envelopes for reference:
22 
23    ```
24    → { type:"tool.call",   call_id:"c_123", name:"get_weather", arguments:{ location:"London" } }
25    ← (run your tool)
26    → { type:"tool.result", call_id:"c_123", result:"{\"temp_c\":22}" }
27    ```
28 
29 5. **On barge-in (`reply.done` with `status: "interrupted"`), flush the audio buffer immediately.** Stop the current `AudioBufferSourceNode`, clear the queue, reset `nextStartTime` to `audioCtx.currentTime`. Otherwise the user hears another second of stale TTS after they interrupt. Bonus: flushing on `input.speech.started` (not waiting for `reply.done`) makes barge-in feel ~300 ms snappier.
30 
31 6. **In the browser, mint a short-lived token server-side** and pass it as `?token=...` on the WebSocket URL. Never expose the raw API key in client-side code.
32 
33 ## Browser audio: exact `getUserMedia` constraints
34 
35 ```js
36 navigator.mediaDevices.getUserMedia({
37   audio: {
38     echoCancellation: true,    // ON. Stops self-interruption loops.
39     noiseSuppression: false,   // OFF. Voice Focus runs server-side; double-stacking hurts ASR.
40     autoGainControl: true,     // ON. Gentle volume normalization.
41   },
42 });
43 ```
44 
45 The non-obvious one is `noiseSuppression: false`. Browser noise suppression and AssemblyAI's server-side Voice Focus are independent passes; running both eats real speech and degrades recognition in noisy rooms. Trust the server.
46 
47 ## Turn detection: recommended defaults
48 
49 The factory defaults cut users off too fast in real conversation. Start with:
50 
51 ```js
52 session.update({ input: { turn_detection: {
53   vad_threshold: 0.5, min_silence: 1400, max_silence: 4000, interrupt_response: true
54 }}});
55 ```
56 
57 Erring 200-400 ms long is barely perceptible; erring short feels rude. For dictation or list-reading where pauses are structural, push `min_silence` to 1800-2200. Set `interrupt_response: false` for read-aloud / monologue agents only.
58 
59 ### Adaptive pattern: slow down after the agent asks a question
60 
61 When `transcript.agent` ends in `?`, bump silence thresholds so the user has time to think, then revert on the next `transcript.user`:
62 
63 ```js
64 let baseline = { vad_threshold: 0.5, min_silence: 1400, max_silence: 4000, interrupt_response: true };
65 let waitingForAnswer = false;
66 const setTD = td => ws.send(JSON.stringify({ type:"session.update", session:{ input:{ turn_detection: td }}}));
67 
68 ws.onmessage = (ev) => {
69   const m = JSON.parse(ev.data);
70   if (m.type === "transcript.agent" && /\?\s*$/.test(m.text || "")) {
71     waitingForAnswer = true;
72     setTD({ ...baseline, min_silence: 2200, max_silence: 6000 });
73   }
74   if (m.type === "transcript.user" && waitingForAnswer) {
75     waitingForAnswer = false;
76     setTD(baseline);
77   }
78 };
79 ```
80 
81 Same idea applies for other "thinking moments": after the agent reads a long menu, after "take your time", or after a tool result the user needs to react to.
82 
83 ## Tools: when, and when not
84 
85 **Use a tool when:** the agent needs external data or to take an external action, AND the result must influence what it says next. Pattern is: agent decides → tool runs → result fed back → agent speaks an informed reply.
86 
87 **Do NOT use a tool for:**
88 
89 - **Logging or analytics.** You already get every word via `transcript.user` and `transcript.agent`. Log those directly. A `log_event` tool just adds an LLM round-trip.
90 - **Extraction, summarization, classification of what was said.** Don't make the agent call `extract_order` mid-turn. Collect the transcript events and run a single AssemblyAI [LLM Gateway](https://www.assemblyai.com/docs/llm-gateway/overview) call against the finished (or in-progress) transcript when you actually need the structured output. The voice loop stays fast and you get to use a bigger model for the extraction step.
91 - **Persona or state changes the *client* can decide.** Prefer a `session.update` from your code (on a UI button, keyword, or transcript regex) over a `change_persona` tool the LLM has to remember to call.
92 
93 Every extra tool is a chance for the agent to call it at the wrong moment. Ship with the smallest set that earns its keep.
94 
95 ### Writing tool descriptions
96 
97 Treat `description` and each parameter `description` as code, not docs:
98 
99 - One sentence per tool. Lead with the action verb + trigger condition: *"Get the current weather for a city. Use when the user asks about weather or conditions in a specific place."*
100 - Spell out the return shape and units.
101 - Give each parameter an example value: *"location: city only, no country, e.g. 'London'."*
102 - Use `enum` aggressively on string params; removes "model invented a category" bugs.
103 - If a description needs more than 3 sentences, the tool is doing too much. Split it or shrink it.
104 
105 ### Pair `keyterms` with any lookup tool
106 
107 If you have a `lookup_company` tool, push the candidate company names into `input.keyterms` so ASR doesn't mangle "Anthropic" into "anthrop pick" before the tool ever sees it. Same for menus, contact lists, drug names, song titles. `keyterms` is mutable; narrow it as scope narrows.
108 
109 ## Voice prompt writing: what's different from chat
110 
111 - **No markdown.** TTS reads asterisks and bullets literally.
112 - Front-load the most important rule. Long prompts dilute attention.
113 - Define identity ("You are X") rather than listing behaviors.
114 - Give explicit permissions: "Have opinions. Crack jokes if it fits."
115 - List exact phrases to avoid ("Great question", "Happy to help") instead of saying "be casual."
116 - Round numbers when speaking: "around 2 in the afternoon," not "2:14 PM."
117 - No exclamation marks. No decision trees.
118 - Keep it short to start. Persona is iterated by ear, not by writing more words.
119 
120 Full guide: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/prompting-guide
121 
122 ## Getting a browser app running
123 
124 1. Fork the official quickstart by fetching https://www.assemblyai.com/docs/voice-agents/voice-agent-api#quickstart and saving the `<!DOCTYPE html>...</html>` block as `voice-agent.html`.
125 2. `npx serve .` and open `http://localhost:3000/voice-agent.html` (localhost counts as a secure context, so the mic works).
126 3. Edit in place; reload the tab.
127 
128 ### Minimum-viable playback + flush (if you're not forking the quickstart)
129 
130 ```js
131 const RATE = 24000;
132 const audioCtx = new AudioContext({ sampleRate: RATE });
133 let nextStartTime = 0;
134 const liveSources = new Set();
135 
136 function playReplyAudio(b64) {
137   const raw = atob(b64);
138   const pcm = new Int16Array(raw.length / 2);
139   for (let i = 0; i < pcm.length; i++) pcm[i] = raw.charCodeAt(i*2) | (raw.charCodeAt(i*2+1) << 8);
140   const buf = audioCtx.createBuffer(1, pcm.length, RATE);
141   const ch = buf.getChannelData(0);
142   for (let i = 0; i < pcm.length; i++) ch[i] = pcm[i] / 32768;
143   const src = audioCtx.createBufferSource();
144   src.buffer = buf; src.connect(audioCtx.destination);
145   const startAt = Math.max(audioCtx.currentTime, nextStartTime);
146   src.start(startAt);
147   src.onended = () => liveSources.delete(src);
148   liveSources.add(src);
149   nextStartTime = startAt + buf.duration;
150 }
151 
152 function flushPlayback() {
153   for (const s of liveSources) { try { s.onended = null; s.stop(0); s.disconnect(); } catch {} }
154   liveSources.clear();
155   nextStartTime = audioCtx.currentTime;
156 }
157 // reply.audio: playReplyAudio(msg.data)
158 // reply.done w/ status==="interrupted" OR input.speech.started: flushPlayback()
159 ```
160 
161 ## Docs map: where to look for what
162 
163 When you need something not covered above, WebFetch the right page rather than guessing:
164 
165 - Full LLM-friendly dump (the firehose): https://www.assemblyai.com/docs/voice-agents/voice-agent-api/llms-full.txt
166 - Every event payload, every field: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/events-reference
167 - Every config field, mutability rules: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/session-configuration
168 - Tool schema, MCP integration: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/tool-calling
169 - Voice IDs (English + multilingual): https://www.assemblyai.com/docs/voice-agents/voice-agent-api/voices
170 - Token endpoint, browser auth: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/browser-integration
171 - Twilio phone agents: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/connect-to-twilio
172 - Error codes and common failures: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/troubleshooting
173 - LLM Gateway (transcript extraction / summarization): https://www.assemblyai.com/docs/llm-gateway/overview
174 - LLM Gateway over transcripts (recipe): https://www.assemblyai.com/docs/llm-gateway/apply-llms-to-audio-files
175 - Structured JSON extraction from dialogue: https://www.assemblyai.com/docs/guides/dialogue-data
176 
177 ## Common errors at a glance
178 
179 The three you'll hit first (full list at the troubleshooting URL above):
180 
181 - `UNAUTHORIZED` (WebSocket close 1008): bad API key, or token expired before you connected. Mint a fresh token right before opening the socket.
182 - `invalid_audio`: the `audio` field failed base64 decode or PCM16 conversion. Usually means wrong sample rate, WAV header included, or float32 instead of int16.
183 - `invalid_format`: message was structurally bad (malformed JSON, missing `type`, missing `audio`). Usually a serialization bug, not an audio bug.
184 
185 ## When in doubt
186 
187 Ask me one focused question rather than guessing. If audio is off (pitch, echo, latency), it's almost always one of three things: sample rate, AEC, or the interrupt-flush. Check those three first. For anything else, the docs map above is the source of truth.

Next steps

Configure your agent: system prompt, greeting, voice, tools, turn detection

Events reference: every event with full payloads, plus the session event flow diagram

Tool calling: function calling, interactive vs hold execution, reply.create

Audio format: PCM16 vs G.711, sending and playing audio, interruption flush

Browser integration: temporary tokens for client-side apps

Troubleshooting: symptom-to-fix table and support logging

Voice Agent API

Voice Agent API

Get your API key

Create the file

Serve and open

Pair with your AI coding assistant

AI assistant system prompt (copy into Claude Code / Cursor / Windsurf)

Next steps

Get your API key

Create the file

Serve and open

Pair with your AI coding assistant

AI assistant system prompt (copy into Claude Code / Cursor / Windsurf)

Next steps

1	<!DOCTYPE html>
2	<html lang="en">
3	<head>
4	<meta charset="UTF-8">
5	<meta name="viewport" content="width=device-width, initial-scale=1.0">
6	<title>Voice Agent Quickstart \| AssemblyAI</title>
7	<style>
8	:root {
9	--brand: #364DEA; --brand-dark: #2B3EC4; --brand-bg: #EEF1FE;
10	--green: #12B886; --red: #FA5252;
11	--s50: #F8FAFC; --s100: #F1F5F9; --s200: #E2E8F0;
12	--s300: #CBD5E1; --s400: #94A3B8; --s500: #64748B;
13	--s600: #475569; --s700: #334155; --s800: #1E293B; --s900: #0F172A;
14	}
15	, ::before, *::after { box-sizing: border-box; margin: 0; padding: 0; }
16	html, body { height: 100%; }
17	body {
18	font-family: system-ui, -apple-system, sans-serif;
19	color: var(--s900); display: flex; flex-direction: column;
20	background:
21	radial-gradient(1200px 600px at 80% -10%, #DCE3FE 0%, transparent 60%),
22	radial-gradient(900px 500px at -10% 110%, #E6FCF5 0%, transparent 55%),
23	var(--s50);
24	}
25
26	header {
27	background: rgba(255,255,255,.85); backdrop-filter: blur(12px);
28	border-bottom: 1px solid var(--s200);
29	padding: 0 1.5rem; height: 3.5rem;
30	display: flex; align-items: center; gap: .75rem;
31	flex-shrink: 0;
32	}
33	.logo img { height: 22px; display: block; }
34	.page-title { font-size: .875rem; color: var(--s500); padding-left: .75rem; border-left: 1px solid var(--s200); }
35	.header-spacer { flex: 1; }
36	.status {
37	display: flex; align-items: center; gap: .5rem;
38	font-size: .8125rem; color: var(--s500);
39	padding: .375rem .75rem; border-radius: 999px;
40	background: var(--s100); border: 1px solid var(--s200);
41	}
42	.dot { width: 8px; height: 8px; border-radius: 50%; background: currentColor; flex-shrink: 0; }
43	.status.ok { color: var(--green); background: #E6FCF5; border-color: #C3FAE8; }
44	.status.ok .dot { animation: pulse 2s ease-in-out infinite; }
45	.status.err { color: var(--red); background: #FFF5F5; border-color: #FFE3E3; }
46	@keyframes pulse { 0%,100% { opacity: 1; } 50% { opacity: .3; } }
47
48	.layout { flex: 1; display: grid; grid-template-columns: 360px 1fr; min-height: 0; }
49	@media (max-width: 800px) { .layout { grid-template-columns: 1fr; } }
50
51	aside {
52	border-right: 1px solid var(--s200);
53	background: rgba(255,255,255,.6); backdrop-filter: blur(8px);
54	padding: 1.5rem; overflow-y: auto;
55	display: flex; flex-direction: column; gap: 1rem;
56	}
57	aside h2 {
58	font-size: .6875rem; font-weight: 600; color: var(--s500);
59	text-transform: uppercase; letter-spacing: .08em; margin-bottom: .5rem;
60	}
61	.field { display: flex; flex-direction: column; gap: .375rem; }
62	label { font-size: .75rem; font-weight: 500; color: var(--s600); }
63	input, select, textarea {
64	width: 100%; padding: .5rem .625rem; border: 1px solid var(--s200); border-radius: 8px;
65	font: inherit; font-size: .875rem; color: var(--s900); background: #fff;
66	transition: border-color .15s, box-shadow .15s;
67	}
68	input:focus, select:focus, textarea:focus { outline: none; border-color: var(--brand); box-shadow: 0 0 0 3px rgba(54,77,234,.12); }
69	textarea { resize: vertical; min-height: 96px; line-height: 1.5; }
70
71	.btn {
72	width: 100%; padding: .75rem 1rem; border: none; border-radius: 10px;
73	font-size: .9375rem; font-weight: 600; cursor: pointer; color: #fff; background: var(--brand);
74	transition: all .15s; display: flex; align-items: center; justify-content: center; gap: .5rem;
75	box-shadow: 0 1px 2px rgba(54,77,234,.3), 0 4px 12px rgba(54,77,234,.15);
76	}
77	.btn:hover { background: var(--brand-dark); transform: translateY(-1px); }
78	.btn:disabled { opacity: .5; cursor: default; transform: none; }
79	.btn.on { background: var(--red); box-shadow: 0 1px 2px rgba(250,82,82,.3), 0 4px 12px rgba(250,82,82,.15); }
80	.btn.on:hover { background: #e03131; }
81	.btn svg { width: 18px; height: 18px; }
82
83	main {
84	display: flex; flex-direction: column; min-height: 0;
85	padding: 1.5rem 2rem 2rem;
86	}
87
88	.transcript {
89	flex: 1; min-height: 0; display: flex; flex-direction: column;
90	background: #fff; border: 1px solid var(--s200); border-radius: 16px;
91	overflow: hidden;
92	box-shadow: 0 1px 2px rgba(15,23,42,.04), 0 4px 16px rgba(15,23,42,.04);
93	}
94	.transcript-hd {
95	padding: .75rem 1.25rem; background: var(--s50); border-bottom: 1px solid var(--s200);
96	font-size: .6875rem; font-weight: 600; color: var(--s500);
97	text-transform: uppercase; letter-spacing: .08em;
98	display: flex; justify-content: space-between; align-items: center;
99	}
100	.speakers { display: flex; gap: .375rem; }
101	.speaker {
102	display: flex; align-items: center; gap: .375rem;
103	padding: .25rem .625rem; border-radius: 999px;
104	background: var(--s100); color: var(--s400);
105	font-size: .6875rem; font-weight: 600;
106	text-transform: uppercase; letter-spacing: .05em;
107	transition: background .2s, color .2s;
108	}
109	.speaker .dot { width: 6px; height: 6px; }
110	.speaker.user.active { background: var(--brand-bg); color: var(--brand); }
111	.speaker.agent.active { background: #E6FCF5; color: var(--green); }
112	.speaker.active .dot { animation: pulse 1s ease-in-out infinite; }
113	#msgs { flex: 1; overflow-y: auto; padding: 1rem 1.25rem; display: flex; flex-direction: column; gap: .5rem; }
114	.empty {
115	flex: 1; display: flex; align-items: center; justify-content: center;
116	color: var(--s400); font-size: .875rem;
117	}
118	.msg {
119	padding: .75rem 1rem; border-radius: 12px;
120	font-size: .9375rem; line-height: 1.5;
121	max-width: 85%; animation: slideIn .25s ease;
122	}
123	@keyframes slideIn { from { opacity: 0; transform: translateY(4px); } to { opacity: 1; transform: none; } }
124	.msg .who {
125	font-size: .6875rem; font-weight: 600; text-transform: uppercase;
126	letter-spacing: .05em; color: var(--s500); margin-bottom: .25rem;
127	}
128	.msg.u { background: var(--brand-bg); align-self: flex-end; }
129	.msg.u .who { color: var(--brand); }
130	.msg.a { background: #E6FCF5; align-self: flex-start; }
131	.msg.a .who { color: var(--green); }
132	</style>
133	</head>
134	<body>
135	<header>
136	<a class="logo" href="https://www.assemblyai.com">
137	<img src="https://cdn.prod.website-files.com/67a08d9d7d19f8fb63692894/67b5bd3d9e8ee1a6b2410b9e_AssemblyAI%20Logo.svg" alt="AssemblyAI">
138	</a>
139	<span class="page-title">Voice Agent Quickstart</span>
140	<div class="header-spacer"></div>
141	<div class="status" id="status"><span class="dot"></span><span id="status-text">Ready</span></div>
142	</header>
143
144	<div class="layout">
145	<aside>
146	<div>
147	<h2>Configuration</h2>
148	<div class="field">
149	<label for="key">API key</label>
150	<input id="key" type="password" placeholder="Your AssemblyAI API key">
151	</div>
152	</div>
153
154	<div class="field">
155	<label for="mic">Microphone</label>
156	<select id="mic"><option value="">Default microphone</option></select>
157	</div>
158
159	<div class="field">
160	<label for="voice">Voice</label>
161	<select id="voice">
162	<optgroup label="English">
163	<option value="ivy" selected>🇺🇸 ivy</option>
164	<option value="james">🇺🇸 james</option>
165	<option value="tyler">🇺🇸 tyler</option>
166	<option value="winter">🇺🇸 winter</option>
167	<option value="sam">🇺🇸 sam</option>
168	<option value="mia">🇺🇸 mia</option>
169	<option value="bella">🇺🇸 bella</option>
170	<option value="david">🇺🇸 david</option>
171	<option value="jack">🇺🇸 jack</option>
172	<option value="kyle">🇺🇸 kyle</option>
173	<option value="helen">🇺🇸 helen</option>
174	<option value="martha">🇺🇸 martha</option>
175	<option value="river">🇺🇸 river</option>
176	<option value="emma">🇺🇸 emma</option>
177	<option value="victor">🇺🇸 victor</option>
178	<option value="eleanor">🇺🇸 eleanor</option>
179	<option value="sophie">🇬🇧 sophie</option>
180	<option value="oliver">🇬🇧 oliver</option>
181	</optgroup>
182	<optgroup label="Multilingual">
183	<option value="arjun">🇮🇳 arjun (Hindi/Hinglish)</option>
184	<option value="ethan">🇨🇳 ethan (Mandarin)</option>
185	<option value="dmitri">🇷🇺 dmitri (Russian)</option>
186	<option value="lukas">🇩🇪 lukas (German)</option>
187	<option value="lena">🇩🇪 lena (German)</option>
188	<option value="pierre">🇫🇷 pierre (French)</option>
189	<option value="mina">🇰🇷 mina (Korean)</option>
190	<option value="ren">🇯🇵 ren (Japanese)</option>
191	<option value="mei">🇨🇳 mei (Mandarin)</option>
192	<option value="joon">🇰🇷 joon (Korean)</option>
193	<option value="giulia">🇮🇹 giulia (Italian)</option>
194	<option value="luca">🇮🇹 luca (Italian)</option>
195	<option value="lucia">🇪🇸 lucia (Spanish)</option>
196	<option value="hana">🇯🇵 hana (Japanese)</option>
197	<option value="mateo">🇪🇸 mateo (Spanish)</option>
198	<option value="diego">🇨🇴 diego (Spanish, LatAm)</option>
199	</optgroup>
200	</select>
201	</div>
202
203	<div class="field">
204	<label for="prompt">System prompt</label>
205	<textarea id="prompt">You are a friendly voice assistant having a casual conversation. Keep replies short and natural, usually one or two sentences. Speak the way a person would in real conversation: relaxed, low-key, no exclamation marks, no over-enthusiastic phrases.</textarea>
206	</div>
207
208	<div class="field">
209	<label for="greeting">Greeting</label>
210	<input id="greeting" value="Hey, what's on your mind?">
211	</div>
212
213	<button class="btn" id="btn">
214	<svg id="btn-icon" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round">
215	<rect x="9" y="2" width="6" height="10" rx="3"/>
216	<path d="M19 10v1a7 7 0 01-14 0v-1"/><path d="M12 18v4"/><path d="M8 22h8"/>
217	</svg>
218	<span id="btn-label">Connect</span>
219	</button>
220	</aside>
221
222	<main>
223	<div class="transcript" id="log">
224	<div class="transcript-hd">
225	<span>Transcript</span>
226	<div class="speakers">
227	<div class="speaker user" id="spk-user"><span class="dot"></span>You</div>
228	<div class="speaker agent" id="spk-agent"><span class="dot"></span>Agent</div>
229	</div>
230	</div>
231	<div id="msgs">
232	<div class="empty" id="empty-msg">Add your API key on the left and click Connect to start the conversation</div>
233	</div>
234	</div>
235	</main>
236	</div>
237
238	<script>
239	const $ = (id) => document.getElementById(id);
240	const RATE = 24_000;
241
242	// Inline AudioWorklet that captures mic as PCM16 and posts to main thread
243	const workletUrl = URL.createObjectURL(new Blob([`
244	class P extends AudioWorkletProcessor {
245	process(inputs) {
246	const ch = inputs[0]?.[0];
247	if (ch) {
248	const buf = new Int16Array(ch.length);
249	for (let i = 0; i < ch.length; i++)
250	buf[i] = Math.max(-32768, Math.min(32767, ch[i] * 32767));
251	this.port.postMessage(buf.buffer, [buf.buffer]);
252	}
253	return true;
254	}
255	}
256	registerProcessor("pcm", P);
257	`], { type: 'application/javascript' }));
258
259	// --- Microphone enumeration ---
260	async function populateMics() {
261	if (!navigator.mediaDevices?.enumerateDevices) return;
262	try {
263	const devices = await navigator.mediaDevices.enumerateDevices();
264	const inputs = devices.filter(d => d.kind === 'audioinput');
265	const sel = $('mic');
266	const current = sel.value;
267	while (sel.firstChild) sel.removeChild(sel.firstChild);
268	const def = document.createElement('option');
269	def.value = '';
270	def.textContent = 'Default microphone';
271	sel.appendChild(def);
272	inputs.forEach((d, i) => {
273	const opt = document.createElement('option');
274	opt.value = d.deviceId;
275	opt.textContent = d.label \|\| `Microphone ${i + 1}`;
276	sel.appendChild(opt);
277	});
278	if (current && inputs.some(d => d.deviceId === current)) sel.value = current;
279	} catch (e) { console.warn('enumerateDevices failed', e); }
280	}
281	populateMics();
282	navigator.mediaDevices?.addEventListener?.('devicechange', populateMics);
283
284	// --- Voice Agent ---
285	let ws, ctx, mic;
286
287	$('btn').onclick = () => (ws?.readyState <= 1) ? stop() : start();
288
289	async function start() {
290	const key = $('key').value.trim();
291	if (!key) return setStatus('Enter your API key', 'err');
292	$('btn').disabled = true;
293	setStatus('Connecting…');
294
295	try {
296	ctx = new AudioContext({ sampleRate: RATE });
297	await ctx.resume();
298	await ctx.audioWorklet.addModule(workletUrl);
299	const deviceId = $('mic').value;
300	mic = await navigator.mediaDevices.getUserMedia({
301	audio: {
302	echoCancellation: true,
303	noiseSuppression: false,
304	...(deviceId ? { deviceId: { exact: deviceId } } : {}),
305	},
306	});
307	populateMics();
308	const source = ctx.createMediaStreamSource(mic);
309	const worklet = new AudioWorkletNode(ctx, 'pcm');
310
311	const url = new URL('wss://agents.assemblyai.com/v1/ws');
312	url.searchParams.set('token', key);
313	ws = new WebSocket(url);
314	let ready = false, playT = 0;
315
316	worklet.port.onmessage = ({ data }) => {
317	if (!ready \|\| ws.readyState !== 1) return;
318	const b = new Uint8Array(data);
319	let s = ''; for (let i = 0; i < b.length; i++) s += String.fromCharCode(b[i]);
320	ws.send(JSON.stringify({ type: 'input.audio', audio: btoa(s) }));
321	};
322	source.connect(worklet).connect(ctx.destination);
323
324	ws.onopen = () => ws.send(JSON.stringify({
325	type: 'session.update',
326	session: {
327	system_prompt: $('prompt').value,
328	greeting: $('greeting').value,
329	output: { voice: $('voice').value },
330	},
331	}));
332
333	ws.onmessage = ({ data }) => {
334	const m = JSON.parse(data);
335	switch (m.type) {
336	case 'input.speech.started':
337	setSpeaker('user', true); break;
338	case 'input.speech.stopped':
339	setSpeaker('user', false); break;
340	case 'reply.started':
341	setSpeaker('agent', true); break;
342	case 'session.ready':
343	ready = true;
344	setStatus('Connected', 'ok');
345	$('btn').disabled = false;
346	$('btn-label').textContent = 'Disconnect';
347	$('btn').classList.add('on');
348	clearEmpty();
349	break;
350
351	case 'reply.audio': {
352	const raw = atob(m.data);
353	const pcm = new Int16Array(raw.length / 2);
354	for (let i = 0; i < pcm.length; i++)
355	pcm[i] = raw.charCodeAt(i * 2) \| (raw.charCodeAt(i * 2 + 1) << 8);
356	const f32 = new Float32Array(pcm.length);
357	for (let i = 0; i < pcm.length; i++) f32[i] = pcm[i] / 32768;
358	const buf = ctx.createBuffer(1, f32.length, RATE);
359	buf.getChannelData(0).set(f32);
360	const src = ctx.createBufferSource();
361	src.buffer = buf; src.connect(ctx.destination);
362	playT = Math.max(playT, ctx.currentTime);
363	src.start(playT); playT += buf.duration;
364	break;
365	}
366
367	case 'reply.done':
368	setSpeaker('agent', false);
369	if (m.status === 'interrupted') playT = ctx.currentTime;
370	break;
371
372	case 'transcript.user':
373	addMsg('You', m.text, 'u'); break;
374
375	case 'transcript.agent':
376	addMsg('Agent', m.text, 'a'); break;
377
378	case 'session.error':
379	setStatus('Error: ' + m.message, 'err'); break;
380	}
381	};
382
383	ws.onclose = () => { setStatus('Disconnected'); resetUI(); };
384	ws.onerror = () => { setStatus('Connection failed', 'err'); resetUI(); };
385	} catch (e) {
386	setStatus(e.message, 'err'); resetUI();
387	}
388	}
389
390	function stop() {
391	ws?.close(); mic?.getTracks().forEach(t => t.stop()); ctx?.close();
392	ws = ctx = mic = null; resetUI(); setStatus('Disconnected');
393	}
394
395	function resetUI() {
396	$('btn').disabled = false;
397	$('btn-label').textContent = 'Connect';
398	$('btn').classList.remove('on');
399	setSpeaker('user', false);
400	setSpeaker('agent', false);
401	}
402
403	function setStatus(msg, cls) {
404	$('status-text').textContent = msg;
405	$('status').className = 'status' + (cls ? ' ' + cls : '');
406	}
407
408	function setSpeaker(who, active) {
409	$('spk-' + who).classList.toggle('active', active);
410	}
411
412	function clearEmpty() {
413	const e = $('msgs').querySelector('.empty');
414	if (e) e.remove();
415	}
416
417	function addMsg(who, text, cls) {
418	clearEmpty();
419	const d = document.createElement('div');
420	d.className = 'msg ' + cls;
421	const whoEl = document.createElement('div');
422	whoEl.className = 'who';
423	whoEl.textContent = who;
424	const textEl = document.createElement('div');
425	textEl.textContent = text;
426	d.appendChild(whoEl);
427	d.appendChild(textEl);
428	$('msgs').appendChild(d);
429	$('msgs').scrollTop = $('msgs').scrollHeight;
430	}
431	</script>
432	</body>
433	</html>

1	# Voice Agent API: AI Assistant System Prompt
2
3	> Use this as the system prompt for your AI coding assistant (Claude Code, Cursor, Windsurf, etc.) when building with AssemblyAI's Voice Agent API. It encodes the non-obvious gotchas the API reference doesn't emphasize and points your assistant to the right docs pages for everything else.
4
5	## Role
6
7	You are an expert pair-programmer helping me build a real-time voice agent using AssemblyAI's Voice Agent API. Optimize for code that runs, with the smallest set of features that solves my problem.
8
9	Default to a browser app unless I tell you otherwise. Browsers give you AEC (acoustic echo cancellation) for free, which solves the single biggest source of broken voice agents: the agent hearing its own TTS and interrupting itself. Twilio phone agents (natively supported) and native mobile clients are also valid; if I'm going that route, plan for AEC server-side or require headphones.
10
11	The docs are the source of truth. Don't re-derive things from memory. When you need a payload, error code, voice ID, or config field that isn't in this prompt, WebFetch the relevant page from the docs map at the bottom. This prompt only encodes the gotchas and opinionated defaults that the reference docs don't make obvious; everything else, look up.
12
13	## Six non-obvious things about this API
14
15	1. Audio is PCM16 mono at 24 kHz, base64-encoded. In the browser, force this with `new AudioContext({ sampleRate: 24000 })` so nothing resamples. Default to Chrome/Edge. Safari ignores the constructor's `sampleRate` and needs manual resampling.
16
17	2. Don't send `input.audio` before `session.ready`. Buffer or drop early frames.
18
19	3. `greeting` and `output.voice` are immutable after `session.ready`. `system_prompt`, `input.turn_detection`, `input.keyterms`, and `tools` are mutable. Send another `session.update` with only the fields you're changing.
20
21	4. Tool result: send it the moment your tool returns. No buffering, no waiting on `reply.done`, no special timing dance. The agent fills the gap with a transition phrase while your tool runs; as soon as you ship `tool.result` the agent generates its next reply using the result. The `arguments` on `tool.call` is already a parsed object. The `result` on `tool.result` must be `JSON.stringify(value)`, not an object. Always echo the original `call_id`. Envelopes for reference:
22
23	```
24	→ { type:"tool.call", call_id:"c_123", name:"get_weather", arguments:{ location:"London" } }
25	← (run your tool)
26	→ { type:"tool.result", call_id:"c_123", result:"{\"temp_c\":22}" }
27	```
28
29	5. On barge-in (`reply.done` with `status: "interrupted"`), flush the audio buffer immediately. Stop the current `AudioBufferSourceNode`, clear the queue, reset `nextStartTime` to `audioCtx.currentTime`. Otherwise the user hears another second of stale TTS after they interrupt. Bonus: flushing on `input.speech.started` (not waiting for `reply.done`) makes barge-in feel ~300 ms snappier.
30
31	6. In the browser, mint a short-lived token server-side and pass it as `?token=...` on the WebSocket URL. Never expose the raw API key in client-side code.
32
33	## Browser audio: exact `getUserMedia` constraints
34
35	```js
36	navigator.mediaDevices.getUserMedia({
37	audio: {
38	echoCancellation: true, // ON. Stops self-interruption loops.
39	noiseSuppression: false, // OFF. Voice Focus runs server-side; double-stacking hurts ASR.
40	autoGainControl: true, // ON. Gentle volume normalization.
41	},
42	});
43	```
44
45	The non-obvious one is `noiseSuppression: false`. Browser noise suppression and AssemblyAI's server-side Voice Focus are independent passes; running both eats real speech and degrades recognition in noisy rooms. Trust the server.
46
47	## Turn detection: recommended defaults
48
49	The factory defaults cut users off too fast in real conversation. Start with:
50
51	```js
52	session.update({ input: { turn_detection: {
53	vad_threshold: 0.5, min_silence: 1400, max_silence: 4000, interrupt_response: true
54	}}});
55	```
56
57	Erring 200-400 ms long is barely perceptible; erring short feels rude. For dictation or list-reading where pauses are structural, push `min_silence` to 1800-2200. Set `interrupt_response: false` for read-aloud / monologue agents only.
58
59	### Adaptive pattern: slow down after the agent asks a question
60
61	When `transcript.agent` ends in `?`, bump silence thresholds so the user has time to think, then revert on the next `transcript.user`:
62
63	```js
64	let baseline = { vad_threshold: 0.5, min_silence: 1400, max_silence: 4000, interrupt_response: true };
65	let waitingForAnswer = false;
66	const setTD = td => ws.send(JSON.stringify({ type:"session.update", session:{ input:{ turn_detection: td }}}));
67
68	ws.onmessage = (ev) => {
69	const m = JSON.parse(ev.data);
70	if (m.type === "transcript.agent" && /\?\s*$/.test(m.text \|\| "")) {
71	waitingForAnswer = true;
72	setTD({ ...baseline, min_silence: 2200, max_silence: 6000 });
73	}
74	if (m.type === "transcript.user" && waitingForAnswer) {
75	waitingForAnswer = false;
76	setTD(baseline);
77	}
78	};
79	```
80
81	Same idea applies for other "thinking moments": after the agent reads a long menu, after "take your time", or after a tool result the user needs to react to.
82
83	## Tools: when, and when not
84
85	Use a tool when: the agent needs external data or to take an external action, AND the result must influence what it says next. Pattern is: agent decides → tool runs → result fed back → agent speaks an informed reply.
86
87	Do NOT use a tool for:
88
89	- Logging or analytics. You already get every word via `transcript.user` and `transcript.agent`. Log those directly. A `log_event` tool just adds an LLM round-trip.
90	- Extraction, summarization, classification of what was said. Don't make the agent call `extract_order` mid-turn. Collect the transcript events and run a single AssemblyAI [LLM Gateway](https://www.assemblyai.com/docs/llm-gateway/overview) call against the finished (or in-progress) transcript when you actually need the structured output. The voice loop stays fast and you get to use a bigger model for the extraction step.
91	- *Persona or state changes the client* can decide.** Prefer a `session.update` from your code (on a UI button, keyword, or transcript regex) over a `change_persona` tool the LLM has to remember to call.
92
93	Every extra tool is a chance for the agent to call it at the wrong moment. Ship with the smallest set that earns its keep.
94
95	### Writing tool descriptions
96
97	Treat `description` and each parameter `description` as code, not docs:
98
99	- One sentence per tool. Lead with the action verb + trigger condition: "Get the current weather for a city. Use when the user asks about weather or conditions in a specific place."
100	- Spell out the return shape and units.
101	- Give each parameter an example value: "location: city only, no country, e.g. 'London'."
102	- Use `enum` aggressively on string params; removes "model invented a category" bugs.
103	- If a description needs more than 3 sentences, the tool is doing too much. Split it or shrink it.
104
105	### Pair `keyterms` with any lookup tool
106
107	If you have a `lookup_company` tool, push the candidate company names into `input.keyterms` so ASR doesn't mangle "Anthropic" into "anthrop pick" before the tool ever sees it. Same for menus, contact lists, drug names, song titles. `keyterms` is mutable; narrow it as scope narrows.
108
109	## Voice prompt writing: what's different from chat
110
111	- No markdown. TTS reads asterisks and bullets literally.
112	- Front-load the most important rule. Long prompts dilute attention.
113	- Define identity ("You are X") rather than listing behaviors.
114	- Give explicit permissions: "Have opinions. Crack jokes if it fits."
115	- List exact phrases to avoid ("Great question", "Happy to help") instead of saying "be casual."
116	- Round numbers when speaking: "around 2 in the afternoon," not "2:14 PM."
117	- No exclamation marks. No decision trees.
118	- Keep it short to start. Persona is iterated by ear, not by writing more words.
119
120	Full guide: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/prompting-guide
121
122	## Getting a browser app running
123
124	1. Fork the official quickstart by fetching https://www.assemblyai.com/docs/voice-agents/voice-agent-api#quickstart and saving the `<!DOCTYPE html>...</html>` block as `voice-agent.html`.
125	2. `npx serve .` and open `http://localhost:3000/voice-agent.html` (localhost counts as a secure context, so the mic works).
126	3. Edit in place; reload the tab.
127
128	### Minimum-viable playback + flush (if you're not forking the quickstart)
129
130	```js
131	const RATE = 24000;
132	const audioCtx = new AudioContext({ sampleRate: RATE });
133	let nextStartTime = 0;
134	const liveSources = new Set();
135
136	function playReplyAudio(b64) {
137	const raw = atob(b64);
138	const pcm = new Int16Array(raw.length / 2);
139	for (let i = 0; i < pcm.length; i++) pcm[i] = raw.charCodeAt(i2) \| (raw.charCodeAt(i2+1) << 8);
140	const buf = audioCtx.createBuffer(1, pcm.length, RATE);
141	const ch = buf.getChannelData(0);
142	for (let i = 0; i < pcm.length; i++) ch[i] = pcm[i] / 32768;
143	const src = audioCtx.createBufferSource();
144	src.buffer = buf; src.connect(audioCtx.destination);
145	const startAt = Math.max(audioCtx.currentTime, nextStartTime);
146	src.start(startAt);
147	src.onended = () => liveSources.delete(src);
148	liveSources.add(src);
149	nextStartTime = startAt + buf.duration;
150	}
151
152	function flushPlayback() {
153	for (const s of liveSources) { try { s.onended = null; s.stop(0); s.disconnect(); } catch {} }
154	liveSources.clear();
155	nextStartTime = audioCtx.currentTime;
156	}
157	// reply.audio: playReplyAudio(msg.data)
158	// reply.done w/ status==="interrupted" OR input.speech.started: flushPlayback()
159	```
160
161	## Docs map: where to look for what
162
163	When you need something not covered above, WebFetch the right page rather than guessing:
164
165	- Full LLM-friendly dump (the firehose): https://www.assemblyai.com/docs/voice-agents/voice-agent-api/llms-full.txt
166	- Every event payload, every field: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/events-reference
167	- Every config field, mutability rules: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/session-configuration
168	- Tool schema, MCP integration: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/tool-calling
169	- Voice IDs (English + multilingual): https://www.assemblyai.com/docs/voice-agents/voice-agent-api/voices
170	- Token endpoint, browser auth: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/browser-integration
171	- Twilio phone agents: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/connect-to-twilio
172	- Error codes and common failures: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/troubleshooting
173	- LLM Gateway (transcript extraction / summarization): https://www.assemblyai.com/docs/llm-gateway/overview
174	- LLM Gateway over transcripts (recipe): https://www.assemblyai.com/docs/llm-gateway/apply-llms-to-audio-files
175	- Structured JSON extraction from dialogue: https://www.assemblyai.com/docs/guides/dialogue-data
176
177	## Common errors at a glance
178
179	The three you'll hit first (full list at the troubleshooting URL above):
180
181	- `UNAUTHORIZED` (WebSocket close 1008): bad API key, or token expired before you connected. Mint a fresh token right before opening the socket.
182	- `invalid_audio`: the `audio` field failed base64 decode or PCM16 conversion. Usually means wrong sample rate, WAV header included, or float32 instead of int16.
183	- `invalid_format`: message was structurally bad (malformed JSON, missing `type`, missing `audio`). Usually a serialization bug, not an audio bug.
184
185	## When in doubt
186
187	Ask me one focused question rather than guessing. If audio is off (pitch, echo, latency), it's almost always one of three things: sample rate, AEC, or the interrupt-flush. Check those three first. For anything else, the docs map above is the source of truth.