Build a voice agent with function calling
This tutorial walks you through building a customer support voice agent with function calling using AssemblyAI's Universal-3 Pro Streaming model as the STT layer, OpenAI GPT-4o for LLM orchestration, and ElevenLabs for voice output.



Your voice agent just told a customer their order shipped to "123 Fake Street" because the STT misheard their address. The LLM called check_order_status with a garbled ID. The function returned an error. The customer is frustrated.
That's the core problem with function-calling voice agents that nobody talks about: garbage in, garbage function call out. If the speech-to-text layer can't accurately capture phone numbers, order IDs, and email addresses—the exact entities your functions need as parameters—your agent will fail not because of bad code, but bad transcription.
This tutorial walks you through building a customer support voice agent with function calling using AssemblyAI's Universal-3 Pro Streaming model as the STT layer, OpenAI GPT-4o for LLM orchestration, and ElevenLabs for voice output. By the end, you'll have an agent that can check order status, schedule callbacks, and escalate to a human—all triggered by voice.
Why STT accuracy is the foundation of function calling
Most developers treat the STT layer as a commodity when building function-calling voice agents. They focus on prompt engineering and tool definitions, then wire in whatever transcription API is cheapest. That's a mistake.
Function calling puts specific demands on transcription that conversational use cases don't. When someone says "My order number is A-B-3-7-9-2," your LLM needs to receive exactly AB3792, not ABE 37 92 or a b 37 92. When a customer says "Call me back at 415-555-0193," that phone number has to be transcribed correctly or your schedule_callback function will store a wrong number.
Universal-3 Pro Streaming has a 34.79% missed entity rate on phone numbers—already lower than competitors. For emails, it's 59.64% missed vs. 89.09% for the previous Universal Streaming model. It's also the #1 model on the Hugging Face Open ASR Leaderboard. These aren't just benchmark wins—they're the difference between function calls that work and ones that silently corrupt your data.
So before we write a single tool definition, let's get the transcription layer right.
What you'll build
A Python voice agent that handles three customer support scenarios:
- Check order status — Customer says their order ID; agent calls get_order_status(order_id)
- Schedule a callback — Customer provides name and phone number; agent calls schedule_callback(name, phone_number)
- Transfer to human — Customer asks to speak to someone; agent calls transfer_to_human(reason)
Stack:
- AssemblyAI Universal-3 Pro Streaming (speech-to-text)
- OpenAI GPT-4o (LLM with function calling)
- ElevenLabs (text-to-speech)
- Python 3.9+
Setup
Install dependencies:
pip install websockets openai elevenlabs pyaudio python-dotenvCreate a .env file:
ASSEMBLYAI_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
ELEVENLABS_API_KEY=your_key_here
Step 1: Connect to Universal-3 Pro Streaming
Universal-3 Pro Streaming streams over WebSocket. You send audio chunks and receive Turn messages back with transcripts and end-of-turn signals. Here's the connection setup:
import asyncio
import json
import os
import websockets
import pyaudio
from dotenv import load_dotenv
load_dotenv()
ASSEMBLYAI_WS_URL = "wss://streaming.assemblyai.com/v3/ws"
SAMPLE_RATE = 16000
CHUNK_SIZE = 8000 # 500ms at 16kHz
async def connect_assemblyai():
params = {
"speech_model": "u3-rt-pro",
"sample_rate": SAMPLE_RATE,
"format_turns": "true",
}
query = "&".join(f"{k}={v}" for k, v in params.items())
url = f"{ASSEMBLYAI_WS_URL}?{query}"
ws = await websockets.connect(
url,
extra_headers={"Authorization":
os.getenv("ASSEMBLYAI_API_KEY")}
)
return ws
The speech_model: "u3-rt-pro" parameter selects Universal-3 Pro Streaming. Setting format_turns: true gives you formatted transcripts (proper casing, punctuation) on each turn—which matters for LLM input quality.
Handling turn messages
The API sends three message types: Begin (session started), Turn (transcript update), and Termination (session ended). The Turn message includes an end_of_turn boolean—when that's true, you have a complete utterance ready to send to the LLM.
async def receive_transcripts(ws, on_turn_complete):
async for message in ws:
data = json.loads(message)
msg_type = data.get("type")
if msg_type == "Begin":
print(f"Session started: {data['id']}")
elif msg_type == "Turn":
transcript = data.get("transcript", "")
if data.get("end_of_turn") and transcript.strip():
print(f"User said: {transcript}")
await on_turn_complete(transcript)
elif msg_type == "Termination":
print("Session ended.")
break
The on_turn_complete callback is where you'll plug in the LLM call. Clean separation: transcription does its job, then hands off.
Step 2: Define your functions
GPT-4o uses JSON Schema to understand what tools are available and what parameters each tool needs. Here are the three functions for your support agent:
TOOLS = [
{
"type": "function",
"function": {
"name": "get_order_status",
"description": "Look up the status of a customer
order by order ID.",
"parameters": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "The customer's order ID,
e.g. AB3792"
}
},
"required": ["order_id"]
}
}
},
{
"type": "function",
"function": {
"name": "schedule_callback",
"description": "Schedule a callback for a customer
who wants to be called back.",
"parameters": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "The customer's full name"
},
"phone_number": {
"type": "string",
"description": "The customer's phone
number in any format"
}
},
"required": ["name", "phone_number"]
}
}
},
{
"type": "function",
"function": {
"name": "transfer_to_human",
"description": "Transfer the customer to a human
agent when requested or when the issue can't be resolved.",
"parameters": {
"type": "object",
"properties": {
"reason": {
"type": "string",
"description": "Brief reason for the
transfer"
}
},
"required": ["reason"]
}
}
}
]
Now implement stub handlers. In production these would connect to your CRM, ticketing system, or call center platform:
def get_order_status(order_id: str) -> str:
# Replace with real order lookup
mock_orders = {
"AB3792": "Shipped — expected delivery April 15",
"CD1204": "Processing — ships within 2 business days",
}
result = mock_orders.get(order_id.upper())
return result if result else f"No order found with ID
{order_id}"
def schedule_callback(name: str, phone_number: str) -> str:
# Replace with real scheduling logic
print(f"[SYSTEM] Callback scheduled: {name} at
{phone_number}")
return f"Got it. We'll call {name} at {phone_number} within 2
hours."
def transfer_to_human(reason: str) -> str:
print(f"[SYSTEM] Transferring to human. Reason: {reason}")
return "Transferring you now. Please hold for a moment."
FUNCTION_MAP = {
"get_order_status": get_order_status,
"schedule_callback": schedule_callback,
"transfer_to_human": transfer_to_human,
}
Step 3: Wire up the LLM with function calling
This is where speech accuracy directly impacts reliability. The transcript from Universal-3 Pro Streaming goes into the message history; GPT-4o decides whether to call a function and what parameters to use.
from openai import OpenAI
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
conversation_history = [
{
"role": "system",
"content": (
"You are a helpful customer support voice agent. "
"Keep responses short—this is a phone call. "
"When a customer mentions an order ID, phone number,
or name, "
"use the appropriate tool. Always confirm details
before acting."
)
}
]
async def process_with_llm(transcript: str) -> str:
conversation_history.append({"role": "user", "content":
transcript})
response = client.chat.completions.create(
model="gpt-4o",
messages=conversation_history,
tools=TOOLS,
tool_choice="auto"
)
message = response.choices[0].message
# No function call — just a conversational reply
if not message.tool_calls:
reply = message.content
conversation_history.append({"role": "assistant",
"content": reply})
return reply
# Handle function calls
conversation_history.append(message)
results = []
for tool_call in message.tool_calls:
fn_name = tool_call.function.name
fn_args = json.loads(tool_call.function.arguments)
print(f"[TOOL] Calling {fn_name} with {fn_args}")
fn_result = FUNCTION_MAP[fn_name](**fn_args)
results.append({
"tool_call_id": tool_call.id,
"role": "tool",
"content": fn_result
})
conversation_history.extend(results)
# Get the final spoken response after tool results
follow_up = client.chat.completions.create(
model="gpt-4o",
messages=conversation_history
)
reply = follow_up.choices[0].message.content
conversation_history.append({"role": "assistant", "content":
reply})
return reply
Notice the two-pass pattern: first call decides whether to invoke a tool, second call generates what the agent actually says based on the tool result. That second call is what gets spoken aloud—it lets the LLM formulate a natural-sounding response rather than reading raw function output to the customer.
Step 4: Add text-to-speech
from elevenlabs.client import ElevenLabs
from elevenlabs import stream as el_stream
el_client = ElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY"))
def speak(text: str):
print(f"Agent: {text}")
audio = el_client.text_to_speech.convert(
text=text,
voice_id="EXAVITQu4vr4xnSDxMaL", # Replace with your
preferred voice
model_id="eleven_turbo_v2",
output_format="pcm_16000"
)
el_stream(audio)
Step 5: Putting it all together
Now connect all the pieces: microphone → AssemblyAI → GPT-4o → ElevenLabs.
import threading
async def run_agent():
ws = await connect_assemblyai()
async def on_turn_complete(transcript: str):
reply = await process_with_llm(transcript)
speak(reply)
# Start mic stream in a background thread
audio = pyaudio.PyAudio()
stream = audio.open(
format=pyaudio.paInt16,
channels=1,
rate=SAMPLE_RATE,
input=True,
frames_per_buffer=CHUNK_SIZE
)
async def send_audio():
try:
while True:
chunk = stream.read(CHUNK_SIZE,
exception_on_overflow=False)
await ws.send(chunk)
await asyncio.sleep(0)
except websockets.ConnectionClosed:
pass
await asyncio.gather(
send_audio(),
receive_transcripts(ws, on_turn_complete)
)
if __name__ == "__main__":
print("Voice agent ready. Speak to begin.")
asyncio.run(run_agent())
Run it with python agent.py. The agent will listen, transcribe, decide whether to call a function, execute it, and speak back a response—all in a single voice turn.
Why entity accuracy matters here more than anywhere
Consider what happens when a customer says: "My order number is A-B-3-7-9-2."
With a lower-accuracy STT model, you might get a b 3792 or ABE 37 92. Your function lookup fails. The LLM tells the customer it can't find their order. They repeat it. The loop continues. That's not a function calling problem—it's a transcription problem.
Universal-3 Pro Streaming is specifically optimized for this. Its missed entity rate for phone numbers is 34.79% vs. 37.11% for the previous model—and much lower than most alternatives. For emails, the improvement is dramatic: 59.64% vs. 89.09%. When your schedule_callback function needs to store a phone number accurately, those numbers translate directly to agent reliability.
The turn detection model also helps here. It uses both acoustic and semantic signals—not just silence detection—so when a customer is reciting a long order number or slowly reading a phone number, the agent doesn't interrupt mid-digit. You can configure end_of_turn_confidence_threshold to tune how long the agent waits before considering a turn complete.
What to build next
This setup gives you a working foundation. A few directions worth exploring from here:
- Add keyterms prompting — If your order IDs have a known format, pass them as keyterms_prompt to boost recognition accuracy for those patterns
- Streaming TTS — Stream the TTS output while the function is still executing for lower perceived latency
- Interruption handling — Let the customer barge in mid-response; Universal-3 Pro Streaming's partial transcripts make this possible without restarting turns
- Production audio — Swap PyAudio for Twilio or LiveKit for phone-based deployment; AssemblyAI has native integrations for both
The thing is, function calling in voice agents is only as good as the words going in. Get the transcription right, and the rest of the stack does its job.
Get started with Universal-3 Pro Streaming. Try the API free — no card required, $50 in credits included.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.



