Insights & Use Cases
May 12, 2026

Build an AI voice agent for customer support that can look up orders

A step-by-step tutorial for building a Python voice agent that handles order lookups, account verification, and human escalation—all on a single WebSocket connection.

Kelsey Foster
Growth
Reviewed by
No items found.
Table of contents

Tier-1 customer support is mostly the same five conversations on repeat: where's my order, can I change my address, can I get a refund, when does this ship, can I talk to a human. They're predictable, they're high-volume, and they don't need a person—they need a voice agent that can actually look things up.

This tutorial walks you through building one. By the end, you'll have a Python voice agent that answers calls, listens for an order ID or email, calls into your backend to check the status, and reads the result back to the customer in real time. When something goes off-script, it transfers to a human with the full conversation context attached.

We're using AssemblyAI's Voice Agent API—one WebSocket that handles the speech understanding, LLM reasoning, voice generation, turn detection, and tool calling in a single connection. Total time to a working prototype: about an afternoon.

Why most support voice agents fail

Before we build, it's worth knowing where these things break. The pattern is almost always the same:

  1. Customer says "my order ID is A-B-3-7-9-2"
  2. STT mishears it as "a b 37 92" or "ABE 379 to"
  3. The LLM calls get_order_status("ab3792") or worse, asks the customer to repeat
  4. Customer hangs up

The agent didn't fail because the LLM was wrong. It failed because the speech-to-text layer couldn't capture the entity correctly. This is why entity accuracy on alphanumerics, emails, and phone numbers matters more than overall WER for support agents—and why we're building on Universal-3 Pro Streaming, which has a 16.7% mixed-entity error rate vs. 23-25% for competing models.

The second-most-common failure: dead air during tool calls. The customer asks a question, the agent calls a backend, and there's a 2-3 second silence while the lookup runs. The Voice Agent API solves this by speaking a natural transition phrase ("let me check that for you") while the tool runs—no dead air, no awkward pauses.

What you'll build

A Python voice support agent that handles three real workflows:

  1. Order status lookup—customer says "where's my order?" then the agent asks for the ID, looks it up, and reads back status, ETA, and tracking number
  2. Customer info verification—customer provides email or phone number, the agent looks up the account, and confirms identity before proceeding
  3. Human escalation—customer asks for a person, or the agent gets stuck, and a graceful transfer happens with conversation context preserved

Stack: - AssemblyAI Voice Agent API (one WebSocket: STT + LLM + TTS) - Python 3.9+ - A backend with order data—we'll mock it; replace with your real CRM or order management system

Setup

pip install websockets pyaudio python-dotenv

Create .env:

ASSEMBLYAI_API_KEY=your_key_here

The Voice Agent API uses a single endpoint: wss://agents.assemblyai.com/v1/ws. One key, one connection, no separate STT or TTS providers to wire in.

Get started with the Voice Agent API

Build a customer support voice agent with speech-to-text, LLM reasoning, and voice generation on a single WebSocket. Sign up free and get $50 in starter credits.

Sign up free

Step 1: Define the support tools

Tools are the agent's interface to your backend. The Voice Agent API uses standard JSON Schema, so anything you can describe with a schema, the agent can call.

For a support agent, you typically want four tools:

import json

TOOLS = [
    {
        "type": "function",
        "name": "get_order_status",
        "description": "Look up an order's current status, shipping ETA, and tracking number by order ID.",
        "parameters": {
            "type": "object",
            "properties": {
                "order_id": {
                    "type": "string",
                    "description": "The customer's order ID, e.g. ORD-12345 or 78231-ABC.",
                },
            },
            "required": ["order_id"],
        },
    },
    {
        "type": "function",
        "name": "lookup_account_by_email",
        "description": "Find a customer account using their email address.",
        "parameters": {
            "type": "object",
            "properties": {
                "email": {"type": "string", "description": "The customer's email address."},
            },
            "required": ["email"],
        },
    },
    {
        "type": "function",
        "name": "list_recent_orders",
        "description": "List the customer's most recent orders. Use after the account is verified.",
        "parameters": {
            "type": "object",
            "properties": {
                "account_id": {"type": "string"},
                "limit": {"type": "integer", "description": "Max number of orders to return.", "default": 5},
            },
            "required": ["account_id"],
        },
    },
    {
        "type": "function",
        "name": "transfer_to_human",
        "description": "Transfer the call to a human agent. Use when the customer asks, when you can't help, or when the issue is sensitive.",
        "parameters": {
            "type": "object",
            "properties": {
                "reason": {"type": "string", "description": "Short reason for the transfer."},
                "summary": {"type": "string", "description": "Brief summary of the conversation so far."},
            },
            "required": ["reason", "summary"],
        },
    },
]

Now implement the actual functions. Replace these stubs with calls to your real backend:

ORDERS_DB = {
    "ORD-12345": {"status": "shipped", "eta": "2026-05-09", "tracking": "1Z999AA10123456784"},
    "ORD-67890": {"status": "processing", "eta": "2026-05-12", "tracking": None},
}

ACCOUNTS_DB = {
    "jane@example.com": {"account_id": "ACC-001", "name": "Jane Doe"},
}

ACCOUNT_ORDERS = {
    "ACC-001": [
        {"order_id": "ORD-12345", "date": "2026-05-01", "total": "$84.99"},
        {"order_id": "ORD-12100", "date": "2026-04-22", "total": "$42.00"},
    ],
}

def run_tool(name: str, args: dict) -> dict:
    if name == "get_order_status":
        order = ORDERS_DB.get(args["order_id"].upper())
        if not order:
            return {"error": "order_not_found", "order_id": args["order_id"]}
        return order

    if name == "lookup_account_by_email":
        account = ACCOUNTS_DB.get(args["email"].lower())
        if not account:
            return {"error": "account_not_found"}
        return account

    if name == "list_recent_orders":
        orders = ACCOUNT_ORDERS.get(args["account_id"], [])
        return {"orders": orders[: args.get("limit", 5)]}

    if name == "transfer_to_human":
        # In production: trigger your call routing / queue handoff here
        return {"transferred": True, "queue": "support-tier-2"}

    return {"error": f"unknown_tool: {name}"}

The error-shape pattern matters. When get_order_status can't find an order, it returns a structured error rather than throwing—that gives the LLM the context it needs to apologize and ask the customer to verify the ID, instead of crashing the conversation.

Step 2: Write the system prompt

The system prompt is where you encode the agent's behavior. For support, you want a few things every time:

  • Identity and tone
  • When to ask for verification before sharing details
  • When to use which tool
  • When to transfer to a human
  • Specific phrasing for transition moments (the "let me check that" line)
SYSTEM_PROMPT = """
You are Avery, a customer support agent for Acme Corp. Your goal is to help customers
quickly and accurately. You have access to tools that let you look up orders and accounts.

Behavior rules:
- Greet warmly and ask how you can help.
- For order questions, ask for the order ID first if the customer hasn't given it.
- If a customer gives an email or phone number, use lookup_account_by_email to verify.
- Read order status, ETA, and tracking number clearly. Don't read raw timestamps —
  say dates naturally (e.g., "Friday, May 9th").
- When you need to call a tool, say a brief transition like "Let me check on that"
  or "One moment while I pull that up."
- If the customer asks for a human, sounds frustrated, or has a complex issue
  (refund disputes, damaged product, billing errors), use transfer_to_human and
  include a short summary.
- Never make up an order ID, status, or tracking number. If a tool returns an error,
  apologize, ask the customer to verify the ID, and try again.
- Keep replies short and conversational. This is a phone call, not an email.
"""

The "never make up" line is the most important sentence in the prompt. Without it, LLMs sometimes invent plausible-sounding tracking numbers when the lookup fails. With it, they ask for clarification instead.

See the Voice Agent API in action

Test a live voice agent with tool calling, turn detection, and natural conversation flow—right in your browser. No code required.

Try playground

Step 3: Connect to the Voice Agent API

Now the WebSocket connection. The pattern is:

  1. Open wss://agents.assemblyai.com/v1/ws with your API key
  2. Send session.update with the system prompt, tools, voice, and greeting
  3. Wait for session.ready, then start streaming microphone audio
  4. Handle incoming events—tool.call, reply.audio, transcript.user, reply.done
import asyncio
import websockets
import os
import pyaudio

API_KEY = os.getenv("ASSEMBLYAI_API_KEY")
WS_URL = "wss://agents.assemblyai.com/v1/ws"
SAMPLE_RATE = 24000

# AFTER
async def run_agent():
    async with websockets.connect(
        WS_URL,
        additional_headers={"Authorization": f"Bearer {API_KEY}"},
    ) as ws:
        # Configure the agent
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "system_prompt": SYSTEM_PROMPT,
                "greeting": "Hi, this is Avery from Acme support. How can I help?",
                "output": {"voice": "ivy"},
                "tools": TOOLS,
            },
        }))

        # Set up microphone capture and speaker playback
        pa = pyaudio.PyAudio()
        mic = pa.open(format=pyaudio.paInt16, channels=1, rate=SAMPLE_RATE,
                      input=True, frames_per_buffer=1024)
        speaker = pa.open(format=pyaudio.paInt16, channels=1, rate=SAMPLE_RATE,
                          output=True)

        ready = asyncio.Event()
        pending_tools = []

        async def send_audio():
            await ready.wait()
            import base64
            while True:
                audio = mic.read(1024, exception_on_overflow=False)
                await ws.send(json.dumps({
                    "type": "input.audio",
                    "audio": base64.b64encode(audio).decode(),
                }))
                await asyncio.sleep(0)

        async def handle_messages():
            async for raw in ws:
                event = json.loads(raw)
                t = event.get("type")

                if t == "session.ready":
                    ready.set()
                    print("Agent ready. Start speaking.")

                elif t == "transcript.user":
                    print(f"\nUser: {event['text']}")

                elif t == "transcript.agent":
                    print(f"Agent: {event['text']}")

                elif t == "reply.audio":
                    import base64
                    speaker.write(base64.b64decode(event["data"]))

                elif t == "tool.call":
                    name = event["name"]
                    args = event.get("arguments", {})
                    print(f"  [tool] {name}({args})")
                    result = run_tool(name, args)
                    pending_tools.append({"call_id": event["call_id"], "result": result})

                elif t == "reply.done":
                    if event.get("status") == "interrupted":
                        pending_tools.clear()
                    elif pending_tools:
                        for tool in pending_tools:
                            await ws.send(json.dumps({
                                "type": "tool.result",
                                "call_id": tool["call_id"],
                                "result": json.dumps(tool["result"]),
                            }))
                        pending_tools.clear()

        await asyncio.gather(send_audio(), handle_messages())

if __name__ == "__main__":
    asyncio.run(run_agent())

A few details that the docs flag and you'd otherwise debug for an hour:

  • Don't send tool.result immediately when you receive tool.call. Accumulate results and send them inside the reply.done handler. Sending too early causes timing issues.
  • Discard pending tool results on interruption. If the user speaks while the agent is generating a transition phrase, you'll get reply.done with status: "interrupted"—clear the buffer and wait for the next turn.
  • Voice names are case-sensitive. Use lowercase: ivy, claire, dawn. An invalid voice returns session.error.

Step 4: Test the three workflows

Run the script and walk through each support scenario. You should hear:

Workflow 1—Order lookup:

You: "Hi, I'm trying to check on order O-R-D 1-2-3-4-5"
Agent: "Sure, let me check on that... I see order ORD-12345. It shipped and is
        on its way — you should have it by Friday, May 9th. The tracking number
        is 1Z999AA10123456784."

Workflow 2—Email-based account lookup:

You: "I forgot my order ID. Can you look me up by email?"
Agent: "Of course. What's the email on the account?"
You: "It's jane at example dot com."
Agent: "One moment... Got it, you're Jane Doe. I see two recent orders:
        ORD-12345 from May 1st for $84.99, and ORD-12100 from April 22nd
        for $42.00. Which one are you asking about?"

Workflow 3—Human transfer:

You: "I just want to talk to a person."
Agent: "I understand. Let me get you over to a teammate now."
[tool.call: transfer_to_human({"reason": "user requested human", "summary": "..."})]

Speak the order ID with hesitation, mumbles, accents, and natural disfluencies—that's where Universal-3 Pro Streaming earns its keep. The agent should still extract the ID correctly because it's tuned for the alphanumeric tokens that voice agents act on.

Step 5: Take it to the phone

This works in your browser through your microphone, but real customer support runs on phones. Twilio Media Streams is the standard bridge—your server accepts the inbound call from Twilio and opens a parallel connection to the Voice Agent API, forwarding audio in both directions.

The Voice Agent API supports audio/pcmu (G.711 u-law at 8 kHz) natively, which matches Twilio's codec exactly. No transcoding, no resampling. The Twilio integration guide walks through the full bridge in about 100 lines of TypeScript.

What to harden before production

Three things you'll want to nail down before pointing this at real customers:

  • Replace the in-memory mocks with calls to your actual CRM or order management system. Add timeouts and error handling so a slow backend doesn't kill the conversation.
  • Log everything. Save user transcripts, tool calls, results, and the agent's responses tied to a session ID. Conversation logs are your debugging tool when something goes wrong on call #4,712. Conversation intelligence features like speaker diarization can help you analyze these logs at scale.
  • Tune turn detection for your acoustic environment. The defaults work for most use cases. For phone audio with background noise, you may want to raise min_end_of_turn_silence_ms slightly so the agent doesn't cut off thoughtful pauses.

Where to go from there

Once the basic order-lookup loop works, the same tool-calling pattern extends to every other support workflow you have: cancel an order, update a shipping address, request a refund, schedule a callback, fetch FAQ answers from a knowledge base. Add the function, describe it in the system prompt, and the agent picks it up—no new infrastructure.

The compounding win: every conversation goes through the same Voice Agent API connection, the same transcription model, the same billing relationship. You're not assembling a new vendor stack; you're adding tools to an agent that already works.

Try the Voice Agent API live on the product page—it's the same API you'd ship with—or grab a free API key with $50 in starter credits and have your first agent answering calls by end of day.

Build your customer support voice agent today

One WebSocket handles speech-to-text, LLM reasoning, voice generation, and tool calling. Get started free and ship your first support agent by end of day.

Sign up free

Frequently asked questions

How do I build an AI voice agent for customer support that can look up orders?

Build it on AssemblyAI's Voice Agent API, register a get_order_status function as a tool with JSON Schema, and connect to the WebSocket at wss://agents.assemblyai.com/v1/ws. The agent transcribes the customer's speech, decides when to call your function, executes it through your backend, and speaks the result back—all on a single connection. Most developers ship a working agent in an afternoon because there's no SDK to learn and no separate STT, LLM, or TTS providers to wire together.

Why does speech-to-text accuracy matter so much for support voice agents?

Support agents constantly need to capture alphanumeric tokens—order IDs, account numbers, email addresses, phone numbers—and a single transcription error breaks the workflow. If the STT layer mishears "ORD-12345" as "or 12 three 45," your get_order_status function gets a garbled ID and returns nothing. AssemblyAI's Voice Agent API is built on Universal-3 Pro Streaming, which has a 16.7% mixed-entity error rate vs. 23–25% for competing models—that's the difference between tool calls that succeed and tool calls that silently fail.

How does tool calling work with the AssemblyAI Voice Agent API?

You register tools by passing an array of function definitions in session.tools on a session.update event. When the agent decides to call a tool, it emits a tool.call event with the function name and arguments. You execute the function and accumulate results, then send tool.result events inside your reply.done handler—not immediately on tool.call. While the tool runs, the agent speaks a brief transition phrase like "let me check that for you" so the conversation never goes silent.

Can I connect AssemblyAI's Voice Agent API to phone calls with Twilio?

Yes—the Voice Agent API supports audio/pcmu (G.711 u-law at 8 kHz) natively, which matches Twilio's codec exactly with no transcoding needed. You set up a server that accepts the inbound Twilio Media Streams call, opens a parallel WebSocket to the Voice Agent API, and forwards audio in both directions. The official Twilio integration guide walks through inbound and outbound calling in about 100 lines of TypeScript.

What's the best way to handle escalation to a human in a customer support voice agent?

Register a transfer_to_human tool with parameters for reason and summary, and instruct the agent in the system prompt to call it when the customer asks for a person, sounds frustrated, or has a complex issue (refund disputes, billing errors, damaged products). The agent generates a short summary of the conversation that you forward to your human queue, so the receiving agent doesn't have to ask the customer to repeat themselves. This is one of the most important workflows to design well—a poor handoff feels worse than no AI at all.

How much does it cost to run a customer support voice agent on AssemblyAI?

The Voice Agent API is $4.50/hr flat—covering speech understanding, LLM reasoning, voice generation, turn detection, and tool calling all in one bill. There are no per-token surcharges, no concurrency caps, and no separate invoices for STT, LLM, and TTS providers. Pricing is billed by the minute on actual conversation duration, and a free tier is available for testing.

Do voice agents built with AssemblyAI work with healthcare workflows?

AssemblyAI offers a BAA for HIPAA workloads and is SOC 2 Type 2, ISO 27001:2022, and PCI DSS v4.0 certified. For clinical use cases (medical front-office voice agents, healthcare contact centers), enable Medical Mode with domain="medical-v1" to improve transcription accuracy on medication names, procedures, conditions, and dosages. Do not point the agent at real PHI without a signed BAA in place.

Title goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Button Text
AI voice agents
Customer Success