Voice agents in noisy environments (Drive-Thrus, Contact Centers, Field)
Voice agents in noisy environments: learn how they work, where they fit, and how to build them for drive-thrus, contact centers, and field teams at scale.



A recent survey of voice agent builders found that 82% feel confident in their technology—but 55% of end users are frustrated with interruptions, mishearing, and repetition. That gap between builder confidence and user reality gets wider in noisy environments: drive-thru lanes with engine noise and wind, contact centers with dozens of overlapping conversations, and field service calls where a technician is shouting over heavy machinery.
This article explains how the voice agent pipeline works, why noise is the hardest unsolved problem in production deployments, and what to consider when building voice agents for environments where clean audio is a luxury. It covers real-world use cases, the core build decision every team faces, and how features like noise suppression, configurable VAD, and dynamic mid-session settings change what's possible in these conditions.
What are voice agents?
A voice agent is an AI-powered system that holds real-time spoken conversations—it listens to what you say, understands it, and talks back. You can call a company, speak naturally ("I need to reschedule my appointment to Thursday"), and the system handles it without pressing a single button or waiting for a human.
That's a meaningful shift from the old way. Legacy IVR systems—Interactive Voice Response—forced you through rigid menus: "Press 1 for billing, press 2 for support." Voice agents skip all of that. They understand natural language, maintain context across a conversation, and take action.
The result is a system that can handle thousands of calls simultaneously, around the clock, without dropping in quality.
How do voice agents work?
Voice agents run on a three-stage pipeline. Each stage feeds directly into the next, which means a failure at any point breaks the whole conversation.
Here's the chain:
- Speech-to-text (STT): Your voice is converted to text in real time
- Large Language Model (LLM): The text is processed, intent is understood, and a response is generated
- Text-to-speech (TTS): That response is converted back into spoken audio
Seems simple—but the details of each stage determine whether a voice agent feels natural or frustrating.
Speech-to-text
Speech-to-text is the first stage, and it's the most consequential. If the system mishears "fifteen" as "fifty," everything downstream is wrong—the LLM reasons from bad input, and the agent gives the user a nonsensical answer.
Most speech recognition models handle clean, studio-quality audio well. The hard part is real-world audio: background noise from a busy call center, regional accents, phone compression that strips audio quality, or a customer rattling off an alphanumeric order code. These conditions expose the gap between a model that scores well on benchmarks and one that actually works when deployed.
There's another variable most people overlook: turn detection. This is how the system knows you've finished speaking versus pausing mid-sentence. Poor turn detection is the most common reason voice agents feel unnatural—they either cut you off or sit in awkward silence. The best implementations handle turn detection at the model level, not as an afterthought bolted on separately. AssemblyAI's Universal-3 Pro Streaming model includes acoustic turn detection built directly into the model, designed for real-time voice agent pipelines where this precision matters.
Language model (LLM)
The LLM is the reasoning layer. Once the transcript arrives from the STT stage, the LLM figures out what the user wants, decides how to respond, and—in most deployments—calls external tools to take action.
- Intent understanding: The LLM interprets what the user actually means, not just what they said literally
- Tool calling: The LLM can query a database, book an appointment, or update a CRM record mid-conversation
- Context management: The LLM tracks what was said earlier so the user doesn't have to repeat themselves
The LLM is only as effective as the transcript it receives. A single misrecognized word can completely change the intent—which is why teams that underinvest in STT quality often blame their LLM for errors that started one stage earlier.
Text-to-speech (TTS)
TTS converts the LLM's text response into audio the user hears. Two things matter most: how natural it sounds and how fast the first audio byte arrives.
The standard approach is sentence-chunked streaming—TTS starts generating audio as soon as the first sentence is ready, rather than waiting for the full response. This cuts perceived latency significantly and makes conversations feel more responsive.
Why noise changes everything
In a quiet office, most Voice AI pipelines work well enough. But production environments are rarely quiet. A drive-thru window has engine idling, wind, rain on the speaker box, and a customer shouting over traffic. A contact center floor has dozens of agents talking simultaneously, with phone compression stripping the audio down to 8kHz. A field technician might be diagnosing equipment while generators run in the background.
Noise doesn't just reduce transcription accuracy—it breaks the entire conversation loop. Background sounds trigger false voice activity detection (VAD) events, causing the agent to think the user is speaking when they're not. That leads to empty turn finalizations, premature interruptions, and the agent responding to nothing. The user experience feels broken even if the underlying STT model is accurate on the words it does capture.
This is why noise handling can't be an afterthought. The most effective approach addresses it at multiple levels: noise suppression on the audio input, VAD tuning calibrated for the specific environment, and turn detection logic that distinguishes between ambient sound and actual speech.
AssemblyAI's Universal-3 Pro Streaming model handles background noise well out of the box—internal testing shows it outperforms preprocessing pipelines, which often introduce artifacts that degrade accuracy. For extreme environments like drive-thrus, AssemblyAI also offers a hosted noise suppression endpoint with standard and aggressive modes, so teams can dial in the right level of filtering for their specific conditions without building a separate denoising pipeline.
What are voice agents used for?
The highest-value deployments share two traits: high call volume and repeatable, structured workflows. Voice agents aren't well-suited for highly ambiguous or emotionally complex interactions—but for predictable, high-frequency tasks, they're hard to beat.
Customer service and support
Customer service is where most organizations deploy voice agents first. The use case is handling tier-1 inquiries—order status, account questions, basic troubleshooting—without routing every call to a human agent.
The business logic is straightforward: if a customer asks "where's my package?" and the answer requires looking up a tracking number, a voice agent can do that in under two seconds at any hour of the day.
What separates good customer service voice agents from frustrating ones is the escalation path. When a conversation becomes too complex or emotionally charged, the agent needs to hand off to a human—with the full conversation context included. A handoff that makes the customer repeat everything they just said is a failed experience, full stop.
Sales and lead qualification
Outbound sales is another strong fit. Voice agents can call a list of prospects, ask qualification questions, and book meetings with human reps when someone meets the criteria. This works because the conversation is structured—there's a predictable flow that doesn't require deep reasoning.
The specific accuracy challenge in sales is proper noun recognition. Company names, people's names, and product titles need to be transcribed correctly for the CRM data to be useful. A lead record that says "works at Corel" instead of "works at Coral" creates downstream problems that are hard to catch and harder to fix. AssemblyAI's streaming speech-to-text is specifically optimized for entity accuracy—emails, phone numbers, and proper nouns—which is where most competing models break down.
Healthcare scheduling and intake
Healthcare is one of the most demanding environments for voice agents. The workflows—appointment scheduling, medication reminders, patient intake—are highly repeatable, which makes them a good structural fit. But the accuracy requirements are stricter than general customer service.
Drug names, dosage instructions, and medical condition names need to be transcribed precisely. AssemblyAI's Medical Mode is designed for this: it delivers significantly better accuracy on medical terminology including medications, procedures, conditions, and dosages, and is available in English, Spanish, German, and French.
Healthcare deployments also require careful handling of protected health information (PHI). AssemblyAI enables covered entities and their business associates subject to HIPAA to use AssemblyAI services to process PHI. AssemblyAI is considered a business associate under HIPAA and offers a Business Associate Addendum (BAA) required under HIPAA to ensure appropriate safeguarding of PHI.
How do you build a voice agent?
The fundamental build decision is this: assemble your own multi-vendor stack or use a unified API that handles the pipeline for you.
The DIY multi-vendor approach means choosing a separate STT provider, LLM provider, and TTS provider, then writing the orchestration layer that connects them, handles errors, manages latency, and reconciles three separate billing systems. This gives you granular control but introduces significant complexity. Debugging is painful because a broken response could originate at any of the three layers.
The unified API approach collapses that complexity. One connection handles the full pipeline—one bill, one set of logs, one integration point. AssemblyAI's Voice Agent API is built on this model: a single WebSocket API that handles STT, LLM reasoning, and TTS for $4.50/hr flat—no token math, no separate input/output charges. It's built on Universal-3 Pro Streaming as the speech accuracy foundation, with turn detection, interruption handling, and tool calling included. The design philosophy is invisible infrastructure—you configure your agent's behavior, connect your backend tools, and build your product without managing the voice plumbing underneath. Your customers should feel like you built it from scratch.
The setup is minimal. Connect to the WebSocket, send a configuration message, and start streaming audio:
mport asyncio
import websockets
import json
API_KEY = "YOUR_ASSEMBLYAI_API_KEY"
async def run_agent():
async with websockets.connect(
"wss://agents.assemblyai.com/v1/ws",
additional_headers={"Authorization": f"Bearer {API_KEY}"},
) as ws:
await ws.send(json.dumps({
"type": "session.update",
"session": {
"system_prompt": "You are a helpful customer service agent.",
"output": {
"voice": "ivy"
},
"tools": [{
"type": "function",
"name": "check_order",
"description": "Look up order status by order ID",
"parameters": {
"type": "object",
"properties": {
"order_id": {"type": "string"}
},
"required": ["order_id"]
}
}]
}
}))
# Stream audio in, get audio backKey elements you'll configure regardless of approach:
- System prompt: Defines the agent's personality, scope, and behavior—and can be updated mid-conversation without reconnecting
- Tool calling: Connects the agent to your backend systems—databases, CRMs, booking platforms—via JSON Schema
- Voice selection: Chooses the TTS voice that fits your brand's tone, configurable under
session.output.voice - Turn detection settings: Controls how the agent handles pauses and interruptions, including configurable VAD threshold, min/max turn silence, and interrupt duration
For noisy environments specifically, the Voice Agent API supports dynamic mid-session configuration via UpdateConfiguration. You can change VAD thresholds, turn silence timing, and domain vocabulary on the fly—increasing max_turn_silence to 3000ms while a customer dictates a credit card number, then dropping back to 1000ms for conversational turns. The keyterms_prompt parameter lets you pass up to 100 domain-specific terms (menu items, product names, medical terms) and update them dynamically as the conversation progresses through different stages.
Session resumption is another critical feature for unreliable environments: if the WebSocket drops—common in field deployments with spotty connectivity—you can reconnect within 30 seconds and pick up where the conversation left off, with full context preserved.
The Voice Agent API works with LiveKit and Pipecat for teams already using those orchestration frameworks. It also works well with Claude Code—simple enough that you can copy the docs, paste them in, and build a working agent the same afternoon.
What are the limitations of voice agents?
Voice agents are genuinely capable—but setting realistic expectations before you build matters. Data from AssemblyAI's Voice Agent Report paints a clear picture: while 82% of builders feel confident in their technology, 55% of end users report frustration with interruptions. The top user complaints—repetition, mishearing, and interruptions—are all symptoms of the same root cause: insufficient speech accuracy in real-world conditions. On the builder side, the top challenges are accuracy (50%), integration complexity (45%), and costs (42%).
- Real-world accuracy is harder than benchmarks suggest: Clean test data doesn't reflect what happens when you add phone compression, background noise, or fast regional speech. Always evaluate with audio from your actual deployment environment—a model that scores well on LibriSpeech may struggle with your contact center audio.
- Latency is a ceiling, not a floor: Total response time—from the user stopping speaking to the agent starting to respond—typically falls between one and two seconds for well-optimized pipelines. That feels natural for structured workflows but can feel slightly mechanical in rapid back-and-forth exchanges.
- Complex reasoning requires guardrails: LLMs handle conversational language well but can struggle with strict procedural logic. Production systems often pair LLM-generated responses with deterministic logic for anything where precision is critical—payment confirmation, dosage instructions, legal disclosures.
- User acceptance varies by context: Many users will happily interact with voice agents for routine tasks. But when topics become sensitive, the expectation shifts—low-friction escalation to a human isn't optional, it's required.
Building with these constraints in mind from day one produces significantly better outcomes than retrofitting solutions after launch.
Final words
Voice agents combine speech-to-text, a large language model, and text-to-speech into a pipeline that handles real spoken conversations at scale. Speech accuracy sets the ceiling for every stage that follows—and in noisy environments, that ceiling drops fast unless your infrastructure is built for it.
For teams building voice agents that need to work in drive-thrus, contact centers, and the field, AssemblyAI's Voice Agent API offers a unified infrastructure path—one WebSocket, one bill at $4.50/hr, and speech accuracy built on Universal-3 Pro Streaming with acoustic turn detection at the model level. Add in built-in noise suppression, configurable VAD, dynamic mid-session settings, and session resumption for unreliable connections, and you have infrastructure designed for the environments where voice agents are hardest to get right—and most valuable when they work.
Frequently asked questions about voice agents
What is the difference between a voice agent and a chatbot?
A voice agent processes spoken audio in real time using a speech-to-text → LLM → text-to-speech pipeline, while a chatbot handles text-based input and output. The core difference is modality—voice introduces real-time audio processing and latency constraints that text-based systems don't face.
What makes a voice agent sound natural rather than robotic?
Naturalness comes from three things working well together: accurate speech recognition, contextually appropriate LLM responses, and TTS with human-like prosody. Turn detection is often the biggest factor—poor turn detection is why many agents feel abrupt or stilted, cutting users off mid-thought or sitting in dead silence.
How accurate does speech-to-text need to be for a voice agent to work well?
It depends on the use case—general customer service tolerates slightly higher word error rates, while medical or financial applications require near-perfect recognition of terminology like drug names and account numbers. The practical rule: test with audio from your actual deployment environment, not clean recordings.
What is the typical end-to-end latency for a voice agent?
End-to-end latency—from the user finishing speaking to the agent beginning to respond—typically falls between one and two seconds for well-optimized pipelines. Streaming architectures, where STT delivers partial transcripts and TTS starts generating audio before the full LLM response is ready, are the standard approach for keeping latency within that range.
Can voice agents handle multiple languages in a single conversation?
Yes—with the right underlying speech-to-text model. AssemblyAI's Universal-3 Pro Streaming covers six languages—English, Spanish, French, German, Italian, and Portuguese—with regional dialect recognition for higher accuracy within those languages. For broader language coverage, AssemblyAI's Universal-2 model supports over 99 languages with automatic language detection and code switching for speakers who move between languages mid-conversation.
What is a conversational AI voice agent?
A conversational AI voice agent is a voice agent capable of multi-turn dialogue—maintaining context across several exchanges rather than treating each utterance as a standalone query. This is what separates modern voice agents from older IVR systems, which reset context with every menu selection and had no memory of what you said ten seconds earlier.
How do voice agents handle noisy environments like drive-thrus or contact centers?
Noise handling requires a multi-layered approach: noise suppression on the audio input to clean the signal, VAD tuning calibrated for the specific environment to prevent false speech detections from background sounds, and turn detection logic that distinguishes between ambient noise and actual speech. The most effective systems handle noise at the model level rather than relying solely on client-side preprocessing, which can introduce artifacts that degrade transcription accuracy.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.