Connect to Twilio | AssemblyAI

Connect Twilio Programmable Voice to the Voice Agent API so callers can have real-time conversations with your agent over the phone. Twilio handles the phone network, your server bridges audio between Twilio Media Streams and the Voice Agent API, and the agent handles speech-to-speech.

Because Twilio’s native G.711 μ-law format is byte-compatible with the Voice Agent API’s audio/pcmu encoding, the server forwards audio as-is with zero transcoding.

Caller ↔ Twilio Media Streams ↔ Your server ↔ Voice Agent API

Before you begin

To complete this guide, you need:

An AssemblyAI API key with Voice Agent access.
A Twilio account with a phone number. You can buy one if you don’t have one.
Node.js 20+.
ngrok for exposing your local server to the internet.

Quickstart

Clone the example repo and get a working Twilio voice agent in minutes.

Clone the repo and install dependencies

$ git clone https://github.com/AssemblyAI/voice-agent-api-twilio-example.git
$ cd voice-agent-api-twilio-example
$ npm install

Start an ngrok tunnel

Twilio needs a public URL to reach your local server. In a separate terminal, start ngrok:

$ ngrok http 3000

Copy the https://...ngrok.app URL from the output.

Configure environment variables

$ cp .env.example .env

Open .env and fill in your keys:

$ ASSEMBLYAI_API_KEY=YOUR_API_KEY
$ HOSTNAME=https://your-ngrok-domain.ngrok.app

Run the server

$ npm run dev

You should see Server running on http://localhost:3000.

Point Twilio at your server

In the Twilio Console, open your phone number’s Voice configuration and set:

A call comes in → Webhook → POST → https://<your-ngrok-domain>/twiml
Call status changes (optional) → Webhook → POST → https://<your-ngrok-domain>/call-status

Call your number

Dial your Twilio number from any phone. You should hear the agent’s greeting, then have a real-time conversation. Watch the server logs to see the event stream.

How it works

When a call comes in, the following sequence happens:

A caller dials your Twilio number.
Twilio sends a webhook to POST /twiml on your server. The server returns TwiML containing a <Stream> element pointed at your WebSocket endpoint.
Twilio opens a Media Streams WebSocket and starts sending the caller’s audio (G.711 μ-law, 8 kHz).
Your server opens a parallel WebSocket to the Voice Agent API and sends a session.update with the system prompt, voice, greeting, tools, and audio format set to audio/pcmu.
Once session.ready fires, the server forwards audio in both directions:
- Caller → Agent: Each Twilio media event becomes an input.audio event.
- Agent → Caller: Each reply.audio event becomes a Twilio media action.
When the caller barges in (input.speech.started), the server sends a Twilio clear action so the agent stops talking immediately.

Return TwiML with a stream

When Twilio receives a call, it hits your /twiml endpoint. The server responds with TwiML that opens a Media Streams WebSocket:

1 app.post("/twiml", (req, res) => {
2   const callId = newCallId();
3   const hostname = process.env.HOSTNAME.replace(/^https?:\/\//, "");
4   const streamUrl = `wss://${hostname}/media-stream/${callId}`;
5 
6   res.type("text/xml").status(200).send(
7     `<Response>
8   <Connect>
9     <Stream url="${streamUrl}" />
10   </Connect>
11 </Response>`,
12   );
13 });

Connect to the Voice Agent API

When Twilio opens the Media Streams WebSocket, the server creates a parallel connection to the Voice Agent API and sends the session configuration:

1 const aaiWs = new WebSocket("wss://agents.assemblyai.com/v1/realtime", {
2   headers: { Authorization: `Bearer ${process.env.ASSEMBLYAI_API_KEY}` },
3 });
4 
5 aaiWs.send(JSON.stringify({
6   type: "session.update",
7   session: {
8     system_prompt: "You are a helpful voice assistant.",
9     greeting: "Hi, thanks for calling. How can I help?",
10     input: { type: "audio", format: { encoding: "audio/pcmu" } },
11     output: {
12       type: "audio",
13       voice: "ivy",
14       format: { encoding: "audio/pcmu" },
15     },
16     tools: [/* your tool definitions */],
17   },
18 }));

Both input and output use audio/pcmu (G.711 μ-law at 8 kHz) to match Twilio’s native codec. This means no transcoding or resampling is needed.

Bridge audio between Twilio and the Voice Agent API

Once session.ready fires, forward audio payloads in both directions:

1 // Twilio → Voice Agent API
2 tw.on("media", (msg) => {
3   if (msg.media.track !== "inbound") return;
4   aaiWs.send(JSON.stringify({
5     type: "input.audio",
6     audio: msg.media.payload,
7   }));
8 });
9 
10 // Voice Agent API → Twilio
11 aaiWs.on("message", (data) => {
12   const event = JSON.parse(data.toString());
13 
14   if (event.type === "reply.audio" && event.data) {
15     tw.send({
16       event: "media",
17       streamSid: tw.streamSid,
18       media: { payload: event.data },
19     });
20   }
21 });

Handle barge-in

When the caller starts speaking while the agent is talking, clear the Twilio audio buffer so the agent stops immediately:

1 if (event.type === "input.speech.started") {
2   tw.send({ event: "clear", streamSid: tw.streamSid });
3 }

Make outbound calls

The example repo also supports outbound calling. Set the Twilio credentials in .env:

$ TWILIO_ACCOUNT_SID=YOUR_TWILIO_ACCOUNT_SID
$ TWILIO_AUTH_TOKEN=YOUR_TWILIO_AUTH_TOKEN
$ TWILIO_PHONE_NUMBER=+15551234567
$ TARGET_PHONE_NUMBER=+15557654321

With the server still running, open a new terminal and run:

$ npm run outbound

This places a call from your Twilio number to the target. Twilio fetches /outbound-twiml, which connects the call to /outbound-stream. The agent speaks first using the configured greeting.

Add custom tools

The example includes one tool, generate_random_number. To add your own tools:

Define the tool in the TOOLS array in src/bot.ts:

1 export const TOOLS = [
2   {
3     type: "function",
4     name: "generate_random_number",
5     description: "Generate a random integer between min and max (inclusive).",
6     parameters: {
7       type: "object",
8       properties: {
9         min: { type: "number", description: "Minimum value (inclusive)." },
10         max: { type: "number", description: "Maximum value (inclusive)." },
11       },
12       required: ["min", "max"],
13     },
14   },
15 ];

Add the handler in runTool:

1 export async function runTool(
2   name: string,
3   args: Record<string, any>,
4 ): Promise<string> {
5   switch (name) {
6     case "generate_random_number": {
7       const min = Math.ceil(args.min);
8       const max = Math.floor(args.max);
9       const result = Math.floor(Math.random() * (max - min + 1)) + min;
10       return JSON.stringify({ result, min: args.min, max: args.max });
11     }
12     default:
13       return JSON.stringify({ error: `Unknown tool: ${name}` });
14   }
15 }

When the agent calls a tool, the Voice Agent API sends a tool.call event. The server runs the tool and sends back a tool.result event with the same call_id. The agent then continues the conversation naturally.

For more on tool calling, see Add tools to your agent.

Troubleshooting

Call connects but no audio — Check that HOSTNAME matches your ngrok domain and that your server is reachable. Watch ngrok’s request log for the incoming Media Streams WebSocket.
session.error with invalid_value on the voice field — Voice names are case-sensitive. Use lowercase (ivy, claire, dawn, etc.). See Choose a voice for available voices.
Greeting plays but later replies don’t — Make sure your tool handler always sends a tool.result back. The agent waits for it before continuing.
Audio is choppy or echoey — Twilio handles echo cancellation on the carrier side. If you hear echo during testing, it’s likely your speakerphone — use a headset.

Next steps

Configure your agent — Customize the system prompt, greeting, and turn detection.
Choose a voice — Pick a voice for your agent.
Add tools to your agent — Give your agent the ability to call functions.
Audio format — Learn more about supported encodings.
Twilio Media Streams — Twilio’s documentation on Media Streams.