April 14, 2026

Build a voice agent with function calling

Kelsey Foster

Growth

AI voice agents

Tutorial

Reviewed by

Table of contents

[Visible on live site]

Your voice agent just told a customer their order shipped to "123 Fake Street" because the STT misheard their address. The LLM called check_order_status with a garbled ID. The function returned an error. The customer is frustrated.

That's the core problem with function-calling voice agents that nobody talks about: garbage in, garbage function call out. If the speech-to-text layer can't accurately capture phone numbers, order IDs, and email addresses—the exact entities your functions need as parameters—your agent will fail not because of bad code, but bad transcription.

This tutorial walks you through building a customer support voice agent with function calling using AssemblyAI's Universal-3 Pro Streaming model as the STT layer, OpenAI GPT-4o for LLM orchestration, and ElevenLabs for voice output. By the end, you'll have an agent that can check order status, schedule callbacks, and escalate to a human—all triggered by voice.

Why STT accuracy is the foundation of function calling

Most developers treat the STT layer as a commodity when building function-calling voice agents. They focus on prompt engineering and tool definitions, then wire in whatever transcription API is cheapest. That's a mistake.

Function calling puts specific demands on transcription that conversational use cases don't. When someone says "My order number is A-B-3-7-9-2," your LLM needs to receive exactly AB3792, not ABE 37 92 or a b 37 92. When a customer says "Call me back at 415-555-0193," that phone number has to be transcribed correctly or your schedule_callback function will store a wrong number.

Universal-3 Pro Streaming has a 34.79% missed entity rate on phone numbers—already lower than competitors. For emails, it's 59.64% missed vs. 89.09% for the previous Universal Streaming model. It's also the #1 model on the Hugging Face Open ASR Leaderboard. These aren't just benchmark wins—they're the difference between function calls that work and ones that silently corrupt your data.

So before we write a single tool definition, let's get the transcription layer right.

What you'll build

A Python voice agent that handles three customer support scenarios:

Check order status — Customer says their order ID; agent calls get_order_status(order_id)
Schedule a callback — Customer provides name and phone number; agent calls schedule_callback(name, phone_number)
Transfer to human — Customer asks to speak to someone; agent calls transfer_to_human(reason)

Stack:

AssemblyAI Universal-3 Pro Streaming (speech-to-text)
OpenAI GPT-4o (LLM with function calling)
ElevenLabs (text-to-speech)
Python 3.9+

Setup

Install dependencies:

pip install websockets openai elevenlabs pyaudio python-dotenv

Create a .env file:

ASSEMBLYAI_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
ELEVENLABS_API_KEY=your_key_here

Step 1: Connect to Universal-3 Pro Streaming

Universal-3 Pro Streaming streams over WebSocket. You send audio chunks and receive Turn messages back with transcripts and end-of-turn signals. Here's the connection setup:

import asyncio
import json
import os
import websockets
import pyaudio
from dotenv import load_dotenv

load_dotenv()

ASSEMBLYAI_WS_URL = "wss://streaming.assemblyai.com/v3/ws"
SAMPLE_RATE = 16000
CHUNK_SIZE = 8000  # 500ms at 16kHz

async def connect_assemblyai():
    params = {
        "speech_model": "u3-rt-pro",
        "sample_rate": SAMPLE_RATE,
        "format_turns": "true",
    }
    query = "&".join(f"{k}={v}" for k, v in params.items())
    url = f"{ASSEMBLYAI_WS_URL}?{query}"

    ws = await websockets.connect(
        url,
        extra_headers={"Authorization": 
os.getenv("ASSEMBLYAI_API_KEY")}
    )
    return ws

The speech_model: "u3-rt-pro" parameter selects Universal-3 Pro Streaming. Setting format_turns: true gives you formatted transcripts (proper casing, punctuation) on each turn—which matters for LLM input quality.

Handling turn messages

The API sends three message types: Begin (session started), Turn (transcript update), and Termination (session ended). The Turn message includes an end_of_turn boolean—when that's true, you have a complete utterance ready to send to the LLM.

async def receive_transcripts(ws, on_turn_complete):
    async for message in ws:
        data = json.loads(message)
        msg_type = data.get("type")

        if msg_type == "Begin":
            print(f"Session started: {data['id']}")

        elif msg_type == "Turn":
            transcript = data.get("transcript", "")
            if data.get("end_of_turn") and transcript.strip():
                print(f"User said: {transcript}")
                await on_turn_complete(transcript)

        elif msg_type == "Termination":
            print("Session ended.")
            break

The on_turn_complete callback is where you'll plug in the LLM call. Clean separation: transcription does its job, then hands off.

Step 2: Define your functions

GPT-4o uses JSON Schema to understand what tools are available and what parameters each tool needs. Here are the three functions for your support agent:

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "get_order_status",
            "description": "Look up the status of a customer 
order by order ID.",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {
                        "type": "string",
                        "description": "The customer's order ID,
e.g. AB3792"
                    }
                },
                "required": ["order_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "schedule_callback",
            "description": "Schedule a callback for a customer 
who wants to be called back.",
            "parameters": {
                "type": "object",
                "properties": {
                    "name": {
                        "type": "string",
                        "description": "The customer's full name"
                    },
                    "phone_number": {
                        "type": "string",
                        "description": "The customer's phone 
number in any format"
                    }
                },
                "required": ["name", "phone_number"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "transfer_to_human",
            "description": "Transfer the customer to a human 
agent when requested or when the issue can't be resolved.",
            "parameters": {
                "type": "object",
                "properties": {
                    "reason": {
                        "type": "string",
                        "description": "Brief reason for the 
transfer"
                    }
                },
                "required": ["reason"]
            }
        }
    }
]

Now implement stub handlers. In production these would connect to your CRM, ticketing system, or call center platform:

def get_order_status(order_id: str) -> str:
    # Replace with real order lookup
    mock_orders = {
        "AB3792": "Shipped — expected delivery April 15",
        "CD1204": "Processing — ships within 2 business days",
    }
    result = mock_orders.get(order_id.upper())
    return result if result else f"No order found with ID 
{order_id}"

def schedule_callback(name: str, phone_number: str) -> str:
    # Replace with real scheduling logic
    print(f"[SYSTEM] Callback scheduled: {name} at
{phone_number}")
    return f"Got it. We'll call {name} at {phone_number} within 2
hours."

def transfer_to_human(reason: str) -> str:
    print(f"[SYSTEM] Transferring to human. Reason: {reason}")
    return "Transferring you now. Please hold for a moment."

FUNCTION_MAP = {
    "get_order_status": get_order_status,
    "schedule_callback": schedule_callback,
    "transfer_to_human": transfer_to_human,
}

Step 3: Wire up the LLM with function calling

This is where speech accuracy directly impacts reliability. The transcript from Universal-3 Pro Streaming goes into the message history; GPT-4o decides whether to call a function and what parameters to use.

from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

conversation_history = [
    {
        "role": "system",
        "content": (
            "You are a helpful customer support voice agent. "
            "Keep responses short—this is a phone call. "
            "When a customer mentions an order ID, phone number, 
or name, "
            "use the appropriate tool. Always confirm details 
before acting."
        )
    }
]

async def process_with_llm(transcript: str) -> str:
    conversation_history.append({"role": "user", "content":
transcript})

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=conversation_history,
        tools=TOOLS,
        tool_choice="auto"
    )

    message = response.choices[0].message

    # No function call — just a conversational reply
    if not message.tool_calls:
        reply = message.content
        conversation_history.append({"role": "assistant", 
"content": reply})
        return reply

    # Handle function calls
    conversation_history.append(message)
    results = []

    for tool_call in message.tool_calls:
        fn_name = tool_call.function.name
        fn_args = json.loads(tool_call.function.arguments)

        print(f"[TOOL] Calling {fn_name} with {fn_args}")
        fn_result = FUNCTION_MAP[fn_name](**fn_args)

        results.append({
            "tool_call_id": tool_call.id,
            "role": "tool",
            "content": fn_result
        })

    conversation_history.extend(results)

    # Get the final spoken response after tool results
    follow_up = client.chat.completions.create(
        model="gpt-4o",
        messages=conversation_history
    )
    reply = follow_up.choices[0].message.content
    conversation_history.append({"role": "assistant", "content": 
reply})
    return reply

Notice the two-pass pattern: first call decides whether to invoke a tool, second call generates what the agent actually says based on the tool result. That second call is what gets spoken aloud—it lets the LLM formulate a natural-sounding response rather than reading raw function output to the customer.

Step 4: Add text-to-speech

from elevenlabs.client import ElevenLabs
from elevenlabs import stream as el_stream

el_client = ElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY"))

def speak(text: str):
    print(f"Agent: {text}")
    audio = el_client.text_to_speech.convert(
        text=text,
        voice_id="EXAVITQu4vr4xnSDxMaL",  # Replace with your
preferred voice
        model_id="eleven_turbo_v2",
        output_format="pcm_16000"
    )
    el_stream(audio)

Step 5: Putting it all together

Now connect all the pieces: microphone → AssemblyAI → GPT-4o → ElevenLabs.

import threading

async def run_agent():
    ws = await connect_assemblyai()

    async def on_turn_complete(transcript: str):
        reply = await process_with_llm(transcript)
        speak(reply)

    # Start mic stream in a background thread
    audio = pyaudio.PyAudio()
    stream = audio.open(
        format=pyaudio.paInt16,
        channels=1,
        rate=SAMPLE_RATE,
        input=True,
        frames_per_buffer=CHUNK_SIZE
    )

    async def send_audio():
        try:
            while True:
                chunk = stream.read(CHUNK_SIZE, 
exception_on_overflow=False)
                await ws.send(chunk)
                await asyncio.sleep(0)
        except websockets.ConnectionClosed:
            pass

    await asyncio.gather(
        send_audio(),
        receive_transcripts(ws, on_turn_complete)
    )

if __name__ == "__main__":
    print("Voice agent ready. Speak to begin.")
    asyncio.run(run_agent())

Run it with python agent.py. The agent will listen, transcribe, decide whether to call a function, execute it, and speak back a response—all in a single voice turn.

Why entity accuracy matters here more than anywhere

Consider what happens when a customer says: "My order number is A-B-3-7-9-2."

With a lower-accuracy STT model, you might get a b 3792 or ABE 37 92. Your function lookup fails. The LLM tells the customer it can't find their order. They repeat it. The loop continues. That's not a function calling problem—it's a transcription problem.

Universal-3 Pro Streaming is specifically optimized for this. Its missed entity rate for phone numbers is 34.79% vs. 37.11% for the previous model—and much lower than most alternatives. For emails, the improvement is dramatic: 59.64% vs. 89.09%. When your schedule_callback function needs to store a phone number accurately, those numbers translate directly to agent reliability.

The turn detection model also helps here. It uses both acoustic and semantic signals—not just silence detection—so when a customer is reciting a long order number or slowly reading a phone number, the agent doesn't interrupt mid-digit. You can configure end_of_turn_confidence_threshold to tune how long the agent waits before considering a turn complete.

What to build next

This setup gives you a working foundation. A few directions worth exploring from here:

Add keyterms prompting — If your order IDs have a known format, pass them as keyterms_prompt to boost recognition accuracy for those patterns
Streaming TTS — Stream the TTS output while the function is still executing for lower perceived latency
Interruption handling — Let the customer barge in mid-response; Universal-3 Pro Streaming's partial transcripts make this possible without restarting turns
Production audio — Swap PyAudio for Twilio or LiveKit for phone-based deployment; AssemblyAI has native integrations for both

The thing is, function calling in voice agents is only as good as the words going in. Get the transcription right, and the rest of the stack does its job.

Get started with Universal-3 Pro Streaming. Try the API free — no card required, $50 in credits included.

‍

Build a voice agent with function calling

Why STT accuracy is the foundation of function calling

What you'll build

Setup

Step 1: Connect to Universal-3 Pro Streaming

Handling turn messages

Step 2: Define your functions

Step 3: Wire up the LLM with function calling

Step 4: Add text-to-speech

Step 5: Putting it all together

Why entity accuracy matters here more than anywhere

What to build next

7 best orchestration tools to build AI voice agents in 2026

How to build a voice agent that transfers to a human

Which voice agent API has the best developer experience? What to evaluate

Best API for building a speech-to-speech voice agent in 2026

What kinds of businesses use automatic transcription?

Introducing Universal-Streaming: Ultra-Fast, Ultra-Accurate Speech-to-Text for Voice Agents

Emergent Abilities of Large Language Models

8 best revenue intelligence platforms using AI in 2026

Build a voice agent with function calling

Why STT accuracy is the foundation of function calling

What you'll build

Setup

Step 1: Connect to Universal-3 Pro Streaming

Handling turn messages

Step 2: Define your functions

Step 3: Wire up the LLM with function calling

Step 4: Add text-to-speech

Step 5: Putting it all together

Why entity accuracy matters here more than anywhere

What to build next

Related posts

7 best orchestration tools to build AI voice agents in 2026

How to build a voice agent that transfers to a human

Which voice agent API has the best developer experience? What to evaluate

Best API for building a speech-to-speech voice agent in 2026

What kinds of businesses use automatic transcription?

Introducing Universal-Streaming: Ultra-Fast, Ultra-Accurate Speech-to-Text for Voice Agents

Emergent Abilities of Large Language Models

8 best revenue intelligence platforms using AI in 2026