Building a voice-powered e-commerce shopping assistant
A Python tutorial for building a voice shopping assistant that searches products, manages carts, confirms checkout with explicit verbal consent, and tracks orders—all on one WebSocket.



Voice shopping has crossed an inflection point. Search by typing is being replaced by search by saying—"show me waterproof hiking boots under $150 in size 10," "add the second one to my cart," "what's the return policy on these." For e-commerce teams, that's both an opportunity and a problem: the existing product search and checkout flow was designed for clicks and keystrokes, not natural language.
This tutorial walks through building a voice-powered shopping assistant that customers can actually talk to. By the end, you'll have a Python voice agent that handles four real shopping workflows—product search, add-to-cart, order tracking, and checkout assistance—all on top of a single WebSocket connection using AssemblyAI's Voice Agent API.
The same pattern works whether you're embedding voice into a mobile shopping app, an in-store kiosk, a smart speaker integration, or a phone-based ordering line.
Why voice e-commerce is different from voice support
If you've built a customer support Voice AI agent, the shopping use case looks similar—but the constraints are sharper:
- Entity accuracy is everything. Sizes ("size ten and a half"), SKUs ("SKU 9-9-2-1-A"), prices ("under one fifty"), quantities ("get me three of those"). Mishear any of those and you've added the wrong item, the wrong size, or the wrong quantity to a cart someone is about to check out with.
- Conversations are exploratory. Support calls have a clear job-to-be-done; shopping conversations meander. The customer browses, narrows, compares, asks about returns, gets distracted, comes back. The agent has to track all of that without losing context.
- Stakes shift mid-conversation. "Tell me about this jacket" is low-stakes. "Charge my saved card for $284.50" is not. The agent needs to know when to ask for confirmation and when to just answer.
- Accent and code-switching show up. Shoppers globally pronounce brand names, colors, and product categories differently. The agent needs to handle that gracefully.
The Voice Agent API addresses these directly: built on Universal-3 Pro Streaming for high entity accuracy (16.7% mixed-entity error rate vs. 23–25% for competitors), with mid-conversation system prompt updates so you can tighten or loosen the agent's behavior as the customer moves from browsing to buying.
What you'll build
A Python voice shopping assistant that handles four workflows:
- Product search—"show me wireless headphones under $200" → searches your catalog → reads back top results
- Cart management—"add the second one in black" → adds to cart → confirms
- Order tracking—"where's my order from last week?" → looks up customer orders → reads back status
- Checkout assistance—guides the user through review and confirmation, never executing payment without explicit verbal "yes"
Stack: - AssemblyAI Voice Agent API (one WebSocket: STT + LLM + TTS) - Python 3.9+ - A product catalog and order DB—we'll mock both; replace with your real Shopify, Commerce Cloud, or BigCommerce backend
Setup
pip install websockets pyaudio python-dotenv
# .env
ASSEMBLYAI_API_KEY=your_key_hereEndpoint: wss://agents.assemblyai.com/v1/ws. One key, one connection—the same key works for all AssemblyAI products.
Step 1: Define the shopping tools
The toolset shapes what your agent can do. Start with the four core shopping verbs and grow from there. The Voice Agent API supports tool calling natively, so each tool is defined as a JSON function schema.
import json
TOOLS = [
{
"type": "function",
"name": "search_products",
"description": "Search the product catalog. Use whenever the customer is browsing or asking what's available.",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Free-text query, e.g. 'waterproof hiking boots'"},
"max_price": {"type": "number"},
"size": {"type": "string"},
"color": {"type": "string"},
"limit": {"type": "integer", "default": 5},
},
"required": ["query"],
},
},
{
"type": "function",
"name": "get_product_details",
"description": "Get full details on a specific product, including return policy and stock.",
"parameters": {
"type": "object",
"properties": {"product_id": {"type": "string"}},
"required": ["product_id"],
},
},
{
"type": "function",
"name": "add_to_cart",
"description": "Add a product to the customer's cart. Confirm size, color, and quantity before calling.",
"parameters": {
"type": "object",
"properties": {
"product_id": {"type": "string"},
"variant_id": {"type": "string", "description": "Specific size/color variant"},
"quantity": {"type": "integer", "default": 1},
},
"required": ["product_id", "variant_id"],
},
},
{
"type": "function",
"name": "view_cart",
"description": "Read back the customer's current cart with subtotal.",
"parameters": {"type": "object", "properties": {}, "required": []},
},
{
"type": "function",
"name": "remove_from_cart",
"description": "Remove an item from the cart by line item ID.",
"parameters": {
"type": "object",
"properties": {"line_item_id": {"type": "string"}},
"required": ["line_item_id"],
},
},
{
"type": "function",
"name": "checkout",
"description": "Submit the order using the customer's saved payment and shipping. ONLY call after explicit verbal 'yes' to a clear confirmation prompt.",
"parameters": {
"type": "object",
"properties": {
"confirmation_phrase": {
"type": "string",
"description": "The exact phrase the customer said to confirm, e.g. 'yes, place the order'.",
}
},
"required": ["confirmation_phrase"],
},
},
{
"type": "function",
"name": "track_order",
"description": "Look up the status of a customer order.",
"parameters": {
"type": "object",
"properties": {"order_id": {"type": "string"}},
"required": ["order_id"],
},
},
]The confirmation_phrase parameter on checkout is the trick that prevents accidental orders. The system prompt tells the agent it can only call checkout if the customer literally says yes—and the parameter forces the agent to record what was said. Your backend can additionally enforce that only a list of accepted phrases ("yes", "place the order", "go ahead") triggers the actual payment.
Step 2: Implement the backend (mocked)
Replace these stubs with calls to your real catalog and order systems.
CATALOG = [
{
"product_id": "SKU-2201",
"name": "Trail Runner 3 hiking boots",
"price": 139.00,
"category": "footwear",
"tags": ["waterproof", "hiking"],
"variants": [
{"variant_id": "SKU-2201-BK-10", "size": "10", "color": "black", "stock": 4},
{"variant_id": "SKU-2201-BK-11", "size": "11", "color": "black", "stock": 0},
{"variant_id": "SKU-2201-BR-10", "size": "10", "color": "brown", "stock": 7},
],
"return_policy": "30-day free returns",
},
{
"product_id": "SKU-3104",
"name": "Summit Pro waterproof boots",
"price": 199.00,
"category": "footwear",
"tags": ["waterproof", "hiking", "premium"],
"variants": [
{"variant_id": "SKU-3104-BK-10", "size": "10", "color": "black", "stock": 2},
],
"return_policy": "30-day free returns",
},
]
CART = [] # In production, scope this per session
ORDERS = {
"ORD-9981": {"status": "shipped", "eta": "2026-05-09", "tracking": "1Z999AA10123456784"},
}
def run_tool(name: str, args: dict) -> dict:
if name == "search_products":
results = [
p for p in CATALOG
if args["query"].lower() in (p["name"] + " " + " ".join(p["tags"])).lower()
and (not args.get("max_price") or p["price"] <= args["max_price"])
]
return {"results": results[: args.get("limit", 5)]}
if name == "get_product_details":
for p in CATALOG:
if p["product_id"] == args["product_id"]:
return p
return {"error": "product_not_found"}
if name == "add_to_cart":
for p in CATALOG:
for v in p["variants"]:
if v["variant_id"] == args["variant_id"]:
if v["stock"] < args.get("quantity", 1):
return {"error": "out_of_stock", "available": v["stock"]}
line = {
"line_item_id": f"LI-{len(CART) + 1}",
"product_id": p["product_id"],
"variant_id": v["variant_id"],
"name": p["name"],
"size": v["size"],
"color": v["color"],
"quantity": args.get("quantity", 1),
"price": p["price"],
}
CART.append(line)
return {"added": line, "cart_size": len(CART)}
return {"error": "variant_not_found"}
if name == "view_cart":
subtotal = sum(item["price"] * item["quantity"] for item in CART)
return {"items": CART, "subtotal": round(subtotal, 2)}
if name == "remove_from_cart":
global CART
CART = [item for item in CART if item["line_item_id"] != args["line_item_id"]]
return {"removed": args["line_item_id"], "cart_size": len(CART)}
if name == "checkout":
# In production: charge saved payment, create order, return order_id
accepted = ["yes", "place the order", "go ahead", "confirm", "buy it"]
if not any(phrase in args["confirmation_phrase"].lower() for phrase in accepted):
return {"error": "confirmation_unclear", "phrase_received": args["confirmation_phrase"]}
return {"order_id": "ORD-9982", "total": sum(i["price"] * i["quantity"] for i in CART)}
if name == "track_order":
order = ORDERS.get(args["order_id"].upper())
return order or {"error": "order_not_found"}
return {"error": f"unknown_tool: {name}"}Step 3: Write a shopping-aware system prompt
Shopping system prompts should encode three patterns: how to describe products on a phone (terse, scannable), how to gather variant info (size, color, quantity) before adding to cart, and how to confirm checkout.
SYSTEM_PROMPT = """
You are Riley, a friendly voice shopping assistant for Trailgear, an outdoor retailer.
Behavior rules:
PRODUCT SEARCH
- When customers ask about products, call search_products with a clean query.
- Read back top 2-3 results conversationally. Don't list more than 3 unless asked.
- Format prices naturally: "one hundred thirty-nine dollars" not "139.00".
- Mention only the most relevant detail per product (price + key feature). Save the rest for follow-ups.
VARIANT SELECTION
- Before adding to cart, confirm size, color, and quantity. Never assume.
- If a variant is out of stock, say so immediately and offer the closest alternative.
- Read sizes naturally: "size ten" not "size 10".
CART MANAGEMENT
- After adding, briefly confirm what was added and the new cart size.
- If the customer asks "what's in my cart," call view_cart and read it back with subtotal.
CHECKOUT
- Before calling checkout, summarize the cart and explicitly ask: "Should I place the order?"
- ONLY call checkout if the customer responds with a clear yes (e.g., "yes," "place it," "go ahead," "confirm").
- If the response is ambiguous, ask again. Do not interpret "sure I think so" as confirmation.
- After checkout succeeds, read back the order ID slowly so the customer can write it down.
ORDER TRACKING
- For order status questions, ask for the order ID.
- When reading a tracking number, slow down and group digits in pairs.
GENERAL
- Keep replies short and conversational. This is voice, not chat.
- When you call a tool, say a brief transition like "Let me look that up."
- Never invent products, prices, or stock — if the catalog doesn't have it, say so.
"""The "explicit yes" pattern is what makes this safe to point at production payments. The agent's prompt forbids it from calling checkout on ambiguous responses, and the backend independently validates the confirmation phrase. Belt and suspenders.
Step 4: Wire the WebSocket
This is essentially the same WebSocket loop as a support agent—the difference is in the tools and prompt, not the protocol. If you've already built a voice agent with function calling, this will look familiar.
import asyncio
import os
import base64
import websockets
import pyaudio
API_KEY = os.getenv("ASSEMBLYAI_API_KEY")
WS_URL = "wss://agents.assemblyai.com/v1/ws"
SAMPLE_RATE = 24000
# AFTER
async def run_assistant():
async with websockets.connect(
WS_URL, additional_headers={"Authorization": f"Bearer {API_KEY}"},
) as ws:
await ws.send(json.dumps({
"type": "session.update",
"session": {
"system_prompt": SYSTEM_PROMPT,
"greeting": "Hi, this is Riley from Trailgear. What can I help you find today?",
"output": {"voice": "mia"},
"tools": TOOLS,
},
}))
pa = pyaudio.PyAudio()
mic = pa.open(format=pyaudio.paInt16, channels=1, rate=SAMPLE_RATE,
input=True, frames_per_buffer=1024)
speaker = pa.open(format=pyaudio.paInt16, channels=1, rate=SAMPLE_RATE,
output=True)
ready = asyncio.Event()
pending_tools = []
async def send_audio():
await ready.wait()
while True:
data = mic.read(1024, exception_on_overflow=False)
await ws.send(json.dumps({
"type": "input.audio",
"audio": base64.b64encode(data).decode(),
}))
await asyncio.sleep(0)
async def handle_messages():
async for raw in ws:
event = json.loads(raw)
t = event.get("type")
if t == "session.ready":
ready.set()
print("Riley is ready.")
elif t == "transcript.user":
print(f"\nCustomer: {event['text']}")
elif t == "transcript.agent":
print(f"Riley: {event['text']}")
elif t == "reply.audio":
speaker.write(base64.b64decode(event["data"]))
elif t == "tool.call":
print(f" [tool] {event['name']}({event.get('arguments', {})})")
result = run_tool(event["name"], event.get("arguments", {}))
pending_tools.append({"call_id": event["call_id"], "result": result})
elif t == "reply.done":
if event.get("status") == "interrupted":
pending_tools.clear()
elif pending_tools:
for tool in pending_tools:
await ws.send(json.dumps({
"type": "tool.result",
"call_id": tool["call_id"],
"result": json.dumps(tool["result"]),
}))
pending_tools.clear()
await asyncio.gather(send_audio(), handle_messages())
if __name__ == "__main__":
asyncio.run(run_assistant())Step 5: Test the four workflows
Run the script and walk through each shopping flow:
Search:
You: "I'm looking for waterproof hiking boots under one fifty."
Riley: "Let me check our catalog... I've got the Trail Runner 3 at $139,
which is waterproof and great for day hikes, and the Summit Pro
at $199 — that's premium with extra ankle support. Want to hear
more about either?"Add to cart:
You: "The Trail Runner threes — size ten, in black."
Riley: "Got it, the Trail Runner 3 in black, size ten. How many?"
You: "Just one."
Riley: "One moment... Added. That's one item in your cart."Checkout:
You: "Okay let's check out."
Riley: "Sure. You have one Trail Runner 3 in black, size ten, for $139.
Should I place the order?"
You: "Yes, place it."
Riley: "Order placed. Your order ID is O-R-D 9-9-8-2."Track order:
You: "Where's my order from last week — ORD 9-9-8-1?"
Riley: "Let me check... ORD-9981 has shipped and should arrive Friday,
May 9th. Tracking is 1Z 99 9A A1 01 23 45 67 84."The tracking number readback is intentional—grouping digits in pairs is a common voice pattern that makes long alphanumerics easier to write down.
Where this gets harder in production
Two patterns to plan for once the basic loop works:
- Personalization. Authenticated shoppers expect the agent to know their saved address, recent purchases, and size preferences. Add a get_customer_profile() tool gated on session auth. Use the result in the system prompt via mid-conversation session.update so the agent personalizes without re-asking.
- Multi-turn refinement. "Show me hiking boots" → "in waterproof" → "size ten only" → "under $150." Each refinement should narrow the same result set rather than triggering a fresh search. Pass a session_filters object as a tool parameter and have the agent accumulate filters across turns.
Where to take it from here
The same architecture extends to:
- In-store kiosks for hands-free product search
- Phone-based ordering lines for restaurants, takeout, and reorders (Twilio Media Streams + the Voice Agent API Twilio integration)
- Mobile shopping apps with a press-and-hold voice button
- Smart speaker integrations that hand off to your agent when the user wants to shop
What stays the same across all of those: one WebSocket, one system prompt, one tool registry. The voice agent is the same regardless of the front-end channel.
Voice shopping isn't replacing search bars or product pages—it's running alongside them, picking up the conversational moments those interfaces can't handle. Build for the conversational moments and the rest of your funnel benefits from it. For teams building AI-powered customer service workflows, the same voice agent architecture handles both pre-sale shopping and post-sale support.
Frequently asked questions
How do I build a voice-powered shopping assistant for e-commerce?
Build it on AssemblyAI's Voice Agent API and register the four core shopping verbs as tools: search_products, add_to_cart, view_cart, and checkout. The agent transcribes the customer's speech, calls your catalog and cart functions, and speaks the result back—all on a single WebSocket. Most developers have a working voice shopping assistant running the same day, with no SDK to install and no separate STT, LLM, or TTS providers to manage.
Can a voice shopping assistant handle product variants like size, color, and quantity?
Yes—define the variant fields as parameters on your add_to_cart tool (e.g., variant_id, quantity) and instruct the agent in the system prompt to confirm size, color, and quantity before calling the function. The Voice Agent API is built on Universal-3 Pro Streaming, which has industry-leading accuracy on alphanumeric tokens like sizes, SKUs, and quantities—that's what makes "size ten and a half" reliably parse as 10.5 instead of 10 or 1010.
How do I prevent accidental orders in a voice checkout flow?
Use a two-layer pattern: the system prompt instructs the agent to only call the checkout tool after an explicit verbal "yes" to a clear confirmation prompt, and the checkout function itself accepts a confirmation_phrase parameter that your backend independently validates against an accepted list ("yes," "place the order," "go ahead," "confirm"). This belt-and-suspenders design ensures ambiguous responses like "sure I think so" never trigger a real charge.
What channels can I deploy a voice shopping assistant on?
The same Voice Agent API connection powers in-app voice (mobile or web with a press-and-hold button), in-store kiosks for hands-free product search, phone-based ordering lines via Twilio Media Streams, and smart speaker integrations that hand off to your agent for shopping. The system prompt and tool registry stay the same across channels—only the front-end audio path changes.
How does the AssemblyAI Voice Agent API compare to Vapi or Retell for e-commerce?
The Voice Agent API is infrastructure rather than a platform—it gives you a standard JSON WebSocket with full control over conversation design, tool integrations, and agent behavior, so your voice shopping experience can feel uniquely yours instead of like every other agent built on a no-code platform. Vapi and Retell are higher-level platforms that work well for non-technical configuration but constrain custom integrations and agent personality. For e-commerce teams that already have engineering capacity, the Voice Agent API is typically a better fit.
How do I personalize a voice shopping assistant for authenticated customers?
Add a get_customer_profile tool that returns the customer's saved address, payment, recent purchases, and size preferences, gated on session auth. The Voice Agent API supports mid-conversation session.update events, so you can update the system prompt with the customer's context after they authenticate without dropping the connection. The agent can then personalize recommendations, default to known sizes, and skip questions like "what's your shipping address?"
How much does it cost to run a voice shopping assistant on AssemblyAI?
The Voice Agent API is $4.50/hr flat-rate, covering STT, LLM, voice generation, turn detection, and tool calling on a single bill. There are no per-token surcharges, no concurrency caps, and no separate invoices for the STT, LLM, and TTS layers—pricing is billed by the minute on actual conversation duration. A free tier with $50 in starter credits is available for testing.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

