Insights & Use Cases
May 7, 2026

Build a voice agent that can make outbound calls with AssemblyAI

The AssemblyAI Voice Agent API handles speech recognition, the language model, and text-to-speech inside a single WebSocket. Twilio's Calls API dials the target number and opens a Media Stream that carries the call audio in both directions. This tutorial wires the two together with a small Express + ws bridge — no separate STT, LLM, or TTS services to wire up.

Kelsey Foster
Growth
Reviewed by
No items found.
Table of contents

Why outbound voice agents matter

A voice agent that can dial out, not just answer, unlocks workflows that text channels drop the ball on:

Use case What the agent does Why outbound beats inbound
Appointment reminders Calls the patient 24 h before, confirms or reschedules Reaches people who never read the SMS
Lead qualification Calls a fresh inbound lead, qualifies, books with sales Engages while interest is still hot
Survey + NPS Reads the prompt, captures freeform answers Higher response rate than email
Past-due collections Calls account, takes payment via tool call Lower agent cost than a human dialer
Recall and renewal Notifies of a recall, prescription refill, or expiring policy Cuts through inbox noise
Customer winback Reaches lapsed customers with a personalized offer More personal than a marketing email

In every case the win is the same: the agent reaches the customer through the channel they actually pick up, holds a real conversation, and writes the outcome to your system of record.

Architecture

The system has three components connected by two WebSockets:

Parameter Type Description
vad_threshold 0.0–1.0 Voice activity detection sensitivity. Raise for noisy phone lines.
min_silence ms Minimum silence before the end-of-turn check fires. Raise for deliberate speech.
max_silence ms Hard cap on silence before forcing end-of-turn.
interrupt_response boolean Set to false to disable barge-in entirely.

The key insight: both legs use audio/pcmu (G.711 μ-law at 8 kHz). Twilio Media Streams already deliver base64-encoded μ-law audio, and the Voice Agent API accepts and emits the same format natively. That means zero resampling — bytes pass through end-to-end.

Prerequisites

  • Node.js 18+ and npm
  • An AssemblyAI API key — free tier available
  • A Twilio account plus a voice-capable phone number in your console
  • ngrok (or any public HTTPS tunnel) so Twilio can reach your dev machine

Consent matters. Automated outbound calls are regulated almost everywhere — TCPA in the US, the various state DNC registries, GDPR in the EU, two-party-consent rules for recording, and more. Disclose that the call is automated in the opener, honor “remove me from the list” requests, and consult counsel before dialing real prospects.

Quick start

1. Clone and install

git clone https://github.com/kelsey-aai/voice-agent-outbound-calls
cd voice-agent-outbound-calls
npm install

2. Configure your environment

cp .env.example .env
# Fill in:
#   ASSEMBLYAI_API_KEY     — from the AssemblyAI dashboard
#   TWILIO_ACCOUNT_SID     — from console.twilio.com
#   TWILIO_AUTH_TOKEN      — from console.twilio.com
#   TWILIO_FROM_NUMBER     — your Twilio voice number, e.g. +15551234567
#   PUBLIC_URL             — leave blank for now; we'll fill it after ngrok

3. Run the server

npm start
# → Listening on http://localhost:3000

4. Expose it with ngrok

In a second terminal:

ngrok http 3000

Copy the HTTPS forwarding URL (e.g. https://ab12cd34.ngrok-free.app) and paste it into .env as PUBLIC_URL. Restart npm start.

5. Place a call

curl -X POST http://localhost:3000/call \
  -H 'content-type: application/json' \
  -d '{"to":"+15551234567"}''

Use your own phone number for the first call so you become the prospect. The phone rings, the agent greets you with the disclosure, and you can talk to it like a human.

How it works

1. Place the call

POST /call receives a JSON body with the target number and asks Twilio to dial it. Twilio's Calls API does the actual dialing and, when the recipient picks up, fetches the URL we passed as url: to get TwiML instructions for the call.

const call = await twilioClient.calls.create({
  to,
  from: TWILIO_FROM_NUMBER,
  url: `${PUBLIC_URL}/twiml`,
});

2. Return TwiML that opens a media stream

When Twilio fetches /twiml, the server returns a tiny piece of XML that wraps the live call in a <Connect><Stream> verb. That verb tells Twilio to open a WebSocket back to our server and pipe the call audio over it.

app.post("/twiml", (_req, res) => {
  const wsUrl = PUBLIC_URL.replace(/^http/, "ws") + "/twilio-stream";
  res.type("text/xml").send(`<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <Stream url="${wsUrl}" />
  </Connect>
</Response>`);
});

3. Bridge two WebSockets

When Twilio connects to /twilio-stream, we open a second WebSocket to AssemblyAI and shuttle messages between them. The first message we send to AssemblyAI is session.update — it configures the agent's personality, voice, and audio formats.

aaiWs.send(JSON.stringify({
  type: "session.update",
  session: {
    system_prompt: SYSTEM_PROMPT,
    greeting: GREETING,
    input:  { format: { encoding: "audio/pcmu" } },
    output: { voice: "ivy", format: { encoding: "audio/pcmu" } },
  },
}));

Both formats are audio/pcmu. Twilio Media Streams already deliver base64-encoded μ-law 8 kHz audio. AssemblyAI accepts that format natively and can emit it back, which means we never decode, resample, or re-encode any audio in this server.

Greeting is set in session.update. Outbound calls need the agent to speak first — the prospect has no idea why their phone is ringing. Setting session.greeting makes the agent open the conversation as soon as the session is ready.

4. Forward audio in both directions

The Twilio side emits connected, start, media, and stop events. We capture streamSid from start, forward media payloads to AssemblyAI as input.audio events, and close the AAI socket on stop.

case "media": {
  const payload = msg.media.payload;  // already base64 μ-law 8 kHz
  aaiWs.send(JSON.stringify({ type: "input.audio", audio: payload }));
  break;
}

Each reply.audio chunk from AssemblyAI is base64 μ-law that we wrap in a Twilio media event and ship straight back to the call:

case "reply.audio":
  twilioWs.send(JSON.stringify({
    event: "media",
    streamSid,
    media: { payload: evt.data },
  }));
  break;

5. Handle barge-in cleanly

When the user speaks while the agent is talking, AssemblyAI emits reply.done with status: "interrupted". On a phone call we also need to flush whatever audio Twilio still has buffered. Twilio supports a clear event for exactly this:

case "reply.done":
  if (evt.status === "interrupted" && streamSid) {
    twilioWs.send(JSON.stringify({ event: "clear", streamSid }));
  }
  break;

6. Echo cancellation is the carrier's job

On a phone call you don't have to think about acoustic echo cancellation — the carrier and the handset handle it. That's a meaningful difference from terminal-based clients, which need headphones to keep the agent from interrupting itself.

Tuning the agent

Voice

Drop any voice ID from the Voices catalog into session.output.voice. Eighteen English voices and 16 multilingual voices are available; multilingual voices code-switch with English automatically.

output: { voice: "james",  format: { encoding: "audio/pcmu" } }
output: { voice: "sophie", format: { encoding: "audio/pcmu" } }
output: { voice: "diego",  format: { encoding: "audio/pcmu" } }

System prompt and greeting

Both live near the top of server.js. Keep them short — phone-call replies should be one or two sentences. Always disclose that the call is automated in the first sentence; several US states require it.

Turn detection

Outbound calls often run on noisier lines than browser-based agents. The defaults in server.js are tuned a little tighter:

turn_detection: {
  vad_threshold: 0.5,        // 0.0–1.0; raise for noisy lines
  min_silence: 400,          // ms; raise for deliberate speech
  max_silence: 1200,         // ms; max wait before forcing end-of-turn
  interrupt_response: true,  // false to disable barge-in
}

See the session configuration reference for every available knob.

Recording, machine detection, and time limits

Twilio's Calls API takes optional flags that you almost certainly want in production:

const call = await twilioClient.calls.create({
  to,
  from: TWILIO_FROM_NUMBER,
  url: `${PUBLIC_URL}/twiml`,
  record: true,
  machineDetection: "Enable",
  timeLimit: 600,  // hard cap in seconds
});

record: true saves the call to Twilio's media store. machineDetection: "Enable" lets you branch on voicemail vs. live human. timeLimit puts a ceiling on a single call so a stuck LLM can't burn budget.

Tools (Function Calling)

Once the conversation works, add tools to let the agent do things — book a meeting, look up an account, mark the lead as DNC. Tools register on the same session.update you already send. The full pattern is covered in the tool-calling guide.

Troubleshooting

The phone rings but the call drops immediately. Check the Twilio console call log. Most often it's a TwiML fetch failure — Twilio couldn't reach PUBLIC_URL/twiml because ngrok died, the URL still says localhost, or the protocol is http:// instead of https://.

Twilio connects but the agent never speaks. Look for [aai] session.ready in your server logs. If you see UNAUTHORIZED, your AssemblyAI key is wrong. If you see no AAI logs at all, your environment variables aren't loaded — confirm .env is next to server.js.

The agent's voice sounds chipmunky or muffled. Both session.input.format.encoding and session.output.format.encoding must be audio/pcmu. If either is left at the default audio/pcm (24 kHz), the formats won't match and Twilio will play the audio at the wrong rate.

The agent keeps talking over me after I interrupt. Make sure you forward the clear event to Twilio when you receive reply.done with status: "interrupted". Without it, Twilio plays out the rest of its buffered audio.

Twilio trial accounts only call verified numbers. That's a Twilio limitation, not a bug in this code. Verify the recipient number in the Twilio console, or upgrade the account.

The full troubleshooting guide is in the Voice Agent API docs.

Frequently asked questions

How do I make a voice agent that places outbound phone calls?

Use Twilio's Calls API to dial the target number and pass it TwiML that opens a <Connect><Stream> to your server. On your server, accept the resulting Media Streams WebSocket and bridge it to AssemblyAI's Voice Agent API. Configure both session.input.format.encoding and session.output.format.encoding as audio/pcmu so Twilio's μ-law 8 kHz audio passes through without resampling.

What audio format should I use for a Twilio voice agent?

Use audio/pcmu (G.711 μ-law, 8 kHz) on both the input and output of the Voice Agent API. Twilio Media Streams emit base64-encoded μ-law 8 kHz audio natively, and the Voice Agent API accepts and emits the same format. That means no decoding, no resampling, and no re-encoding.

How does the Voice Agent API handle barge-in over a phone call?

When the user speaks while the agent is talking, the Voice Agent API emits reply.done with status: "interrupted". On a Twilio call you also need to flush Twilio's outbound buffer by sending {event: "clear", streamSid} over the Media Streams WebSocket.

Do I need separate STT, LLM, and TTS for an outbound voice agent?

No. The AssemblyAI Voice Agent API bundles speech recognition, the language model, and text-to-speech behind a single WebSocket. You stream telephony audio in and get the agent's spoken audio back, with neural turn detection, barge-in, and tool calling built in.

How do I authenticate from a Node.js server?

Pass your AssemblyAI API key as a Bearer token in the Authorization HTTP header on the WebSocket upgrade request: new WebSocket(url, { headers: { Authorization: "Bearer YOUR_API_KEY" } }).

Is it legal to call prospects with an AI voice agent?

It depends on jurisdiction and use case. In the US, the TCPA and state DNC registries restrict automated calls. Several states require AI disclosure in the opener. The EU's GDPR and ePrivacy rules add their own requirements. Disclose that the call is automated, honor opt-out requests, and consult counsel before dialing real prospects.

How much does it cost?

AssemblyAI offers a free tier so you can prototype without a credit card. For current pricing, see the AssemblyAI pricing page. Twilio bills separately for outbound minutes and the phone number itself.

Title goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Button Text
AI voice agents
Voice Agent API