What actually makes a good Voice Agent

Insights from
and
about accuracy, ROI, and the gap between confidence and execution
01
The voice agent boom
02
The experience gap
03
THE CONFIDENCE TRAP
04
WHAT WINNERS DO DIFFERENTLY
05
WHAT’S NEXT

Executive Summary

Voice agent investment is surging with 8x funding growth in 2024 and a projected $47.5B market by 2034. Yet user satisfaction remains stubbornly low. We surveyed 455 builders to learn more about this gap between expectations and experience.

The core finding: The teams winning with voice agents aren't the ones with the biggest budgets or most advanced models. They're the ones who figured out that success requires solving fundamental user experience problems first.

Other Key Findings
Actively building voice agents
87.5%
Top user frustration: repeating themselves
55%
Rate speech-to-text accuracy as critical
76%
Choose hybrid build approaches
44%
Measure ROI primarily by cost savings
50%

This report breaks down where teams struggle, what successful implementations look like, and why accuracy beats cost optimization every time.

Chapter one

The Voice Agent Boom

Technical and economic decisions that matter
13%
Only 13% of those surveyed aren't building or implementing voice agents
*
If you aren't (at the very least) experimenting with building today, you're behind.

The numbers don’t lie. The voice agent market is exploding.

  1. Voice agent market will grow from $2.4B in 2024 to $47.5B by 2034, a 34.8% compound annual growth rate.
  2. Voice AI funding exploded 8x in 2024, reaching $2.1B.
  3. 22% of Y Combinator's recent cohort is building with voice, up 70% from the previous year.

Yet user satisfaction rates remain surprisingly low. This report focuses on that gap between promise and performance.

Voice agent experience level
Built & Implemented
40%
Built/Tried to Build
30%
Implemented Third-Party
17.5%
Evaluated Options
7.5%
Research Only
5%
We didn't survey spectators. 87% are already in the game.

This isn't a survey of curious observers. These are practitioners deep in the work.

Voice agents will become table-stakes within 12–24 months and the primary interface for our product within 3–5 years.
Voice agent lead, Nexlify AI
See why AssemblyAI is the platform powering the next generation of voice agents.
Try it now
Chapter Two

The Experience Gap

Technical and economic decisions that matter
55%
cite "having to repeat themselves" as their top frustration and the #1 barrier to adoption
32.5%
say their users prefer human interactions over voice agents

The barriers keeping people from using voice agents are experiential, not theoretical. Here’s what makes users abandon voice agents and what these frustrations cost companies.

The repeat-yourself problem

Top user frustrations
55%
Having to repeat themselves
45%
Frequently misheard words
When users have to say it twice, they start wondering why they're not just typing.

This cluster affects more than half of users and represents the foundational problem that undermines everything else. When users speak, the system mishears. Users rephrase. The system still doesn't understand. Users give up or demand a human.

When the core promise is convenience and efficiency, forcing users to repeat themselves destroys all value.

When the system won’t shut up

The Interruption frustration
47.5%
System interrupts mid-sentence
It's easy to ignore a pop-up. It's harder to ignore getting cut off mid-sentence.

You can scroll past a visual interruption, but you can't unhear a voice agent cutting you off mid-thought. That intrusion breaks conversational flow and feels disrespectful.

Well-designed turn detection requires distinguishing between natural pauses for breath, thinking pauses, actual end of turn, background noise, and acknowledgment sounds ("uh-huh") versus actual responses. But get this wrong in either direction (cut off users or wait too long) and you destroy conversational flow, derailing the entire interaction.

What else goes wrong

Other user issues
Background noise issues
40%
Robotic/unnatural voice
37.5%
Long pauses
35%
Limited context understanding
30%
Accent/dialect difficulty
27.5%
Accuracy gets the headlines. These issues do the quiet damage.
Users hate repeating themselves (e.g., rephrasing because the agent misses accents or slang). Rigid scripts feel robotic. Vague error messages ('I don't understand') leave users stuck.
EHS Specialist, Exxon Mobil Corporation

The price of frustration

Nearly 1/3 of respondents today prefer human interactions over AI interactions. That's a problem you pay for.

Customer churn
Nearly everyone (95% of respondents) have been frustrated with voice agents at some point.
Support escalations
Poor accuracy drives human handoffs, which eliminates cost savings.
Reputation damage
Users share bad experiences, creating adoption resistance and harming customer trust
Lost revenue
Bad experiences train users to avoid agents altogether, killing ROI before it starts
Chapter Three

The confidence trap

Technical and economic decisions that matter
82.5%
feel confident building voice agents
75%
struggle with technical reliability barriers

Here's the disconnect: 82.5% of builders feel confident building voice agents. Yet 75% of those same teams report struggling with technical reliability barriers like accuracy issues, integration challenges, and cost overruns that compound into the user frustrations we just covered.

Confidence isn't the problem. Teams know how to build. What they underestimate is how accuracy failures, integration headaches, and budget constraints reinforce each other in production.

Three problems, one vicious cycle

Top building challenges
Accuracy/misunderstandings
52.5%
Integration difficulty
45%
High costs
42.5%
Fix one, and the other two pull you back. 
Successful teams tackle all three at once.

Why these compound: Teams that try to solve them sequentially (first make it accurate, then integrate, then reduce costs) consistently fail.

  • Accuracy failures (52.5%) directly correlate with user frustration. Real-world WER often runs 2-3x worse than clean benchmark data. Critical tokens (names, emails, addresses, jargon) matter more than generic benchmark scores. When accuracy fails, costs spike through human escalations, customer churn, and rework.
  • Integration difficulty (45%) extends timelines and inflates costs. Voice agents must preserve context when escalating to humans and work seamlessly within existing workflows; systems built in isolation fail.
  • High costs (42.5%) prevent teams from solving accuracy. Hidden costs multiply quickly: system integration ($1K–50K), training ($500–2K), compliance add-ons, and MVP development ($40K–100K+ for a basic agent). Cost optimization too early creates agents users avoid.
[Our] design came from a failed prototype. We built something too flashy and it collapsed under latency. So we stripped it back, kept only what made callers feel 'heard,' and rebuilt from a simple streaming ASR → NLU → action loop. That failure taught us clarity beats cleverness.
CEO, FreightSignal Systems
More hurdles ahead
Skills gaps
25%
Even among confident teams, one in four admits they lack necessary skills spanning NLP engineering, voice UX design, and conversational AI architecture.
Security & compliance
62.5%
Voice data contains sensitive personal information. HIPAA and SOC-2 certification add significant expenses. Teams must solve this before deployment, not after.
Reliability and uptime
27.5%
Voice agents must work consistently, not just during demos. Cascading failures mean if speech-to-text, LLM, or text-to-speech fails, the entire agent fails.
See why AssemblyAI is the platform powering the next generation of voice agents.
Try it now
Chapter Four

What Winners Do Differently

Technical and economic decisions that matter
76%
of participants chose speech-to-text accuracy and 74% chose conversational intelligence as non-negotiable requirements
62%
of participants cited cost-effectiveness as a primary decision driver, ranking the lowest
What teams prioritize
Speech-to-text accuracy
76%
Conversational understanding
74%
Response speed/low latency
74%
Integration capabilities
70%
Background noise handling
66%
Natural-sounding voice
66%
Accent/dialect support
62%
Cost-effectiveness
62%
Cost-effectiveness ranks dead last. Teams know what actually moves the needle.

Teams prioritize quality over price. A cheap but inaccurate agent creates more problems than it solves. Users who repeat themselves 3x don't care that the interaction only costs $0.05.

Build, buy, or both?

How teams build
44%
Hybrid (Custom + Vendor)
30%
Third-Party Platform
22.5%
In-House (Fully Custom)
3.5%
Still Evaluating
Hybrid wins. Let vendors solve the hard infrastructure problems; save your engineering hours for differentiation.

Hybrid approaches lead the pack, which suggests teams want the flexibility of custom logic with the reliability of vendor infrastructure. The 30% using third-party platforms prioritize speed-to-market, while the 22.5% going fully custom value maximum control.

What gets measured
Accuracy targets (>95%)
47.5%
First call resolution
40%
CSAT/NPS improvement
37.5%
Low-latency targets
35%
High containment rate
32.5%
Cost per interaction
30%
Reduced handling time
27.5%
Nearly half of teams target >95% accuracy as their north star. Everything else follows from getting transcription right.

Nearly half of the respondents target >95% transcription accuracy as a key performance indicator. 92.5% measure ROI through either cost savings OR customer satisfaction, which are the two primary value drivers.

We chose a hybrid approach because fully custom was too slow to build, and vendor-only solutions were too limited. The mix gave us both speed and flexibility.
CTO, NextWave Solutions
The ROI divide
Teams finding ROI with voice agents
Teams struggling to find ROI with voice agents
Invest in accuracy first because it reduces long-term costs through fewer escalations
Track multiple metrics, not just one dimension of success
Balance cost efficiency with user experience
Demonstrate improvements within 60–90 days
Deploy without clear success criteria
Optimize only for cost while ignoring satisfaction
Measure vanity metrics that don't tie to business outcomes
Try to cut costs before solving accuracy
Chapter Five

what’s next

Technical and economic decisions that matter

The vast majority describe voice agents as "critical" or "a game-changer" with a 2–5 year horizon for mainstream adoption.

Where the market is headed

We asked builders how important voice agents will be for their products and customers moving forward. Their answers paint a clear picture.

Competitive differentiation
From alternative to primary interface
It will be indispensable for customer satisfaction and loyalty: Our clients rely on personalized, fast service to compete with larger brands. A robust voice agent will let them offer 24/7 support, instant query resolution, and tailored interactions—turning one-time customers into repeat buyers.
Conversational AI Engineer, Vida Global Inc
Voice agents will be huge. As expectations for instant, human-like support grow, they'll become the frontline handling scale, saving time, and keeping things personal without burning out teams.
Machine Learning Engineer, Replicant
Operational transformation
Accessibility and inclusion
It will drive operational scalability for our business—automating high-volume, repetitive customer interactions will free up our team to focus on complex tasks, while expanding our ability to serve more customers without proportional increases in resources.
Head of Engineering, Cartesia Labs
Voice will make AI accessible to the 1+ billion people with low literacy.
Manager, Page Parkes
conclusion

So what actually makes a good voice agent

Technical and economic decisions that matter

The market has reached an inflection point: widespread adoption but persistent disappointment. 82.5% of builders feel confident, yet users remain frustrated with the fundamentals. What does it actually take to get this right?

Straight from the builders

A good voice agent is one that is easily promptable, one that has good end-of-thought so that it does not interrupt the customer but knows when to talk, has reliable transitions between phrases, and has good pronunciation of various phrases with decent pauses and good time to first audio.
Senior Founding Engineer, AviaryAI

It understands you without making you work for it. You don't have to speak formally or repeat yourself 3 times. It gets slang, accents, mumbles, or even messy follow-ups. And it remembers context: if you ask a question then say 'Do they deliver?', it doesn't make you clarify—it connects the dots, just like a person would.
EHS Specialist, Exxon Mobil Corporation

A good voice agent understands what you mean, not just what you say, and gets things done without making you repeat yourself. It's smart, but never robotic. Just smooth, reliable, and easy to trust.
Machine Learning Engineer, Replicant

A good voice agent feels like a competent, patient human who actually listens and never makes you repeat yourself.
AI Product Lead, Acme Health
The non-negotiables
Technical foundation 
(necessary but not sufficient)
Nearly everyone (95% of respondents) have been frustrated with voice agents at some point.
User experience foundation 
(where most fail)
Don't make users repeat themselves. Get turn-taking right so agents aren’t interrupting. Understand intent, not just words. Retain context and respond with appropriate tone.
Organizational foundation
Track ROI metrics that cover both cost and satisfaction. Solve accuracy, integration, and cost simultaneously rather than sequentially. Set realistic expectations about what voice agents can and can't do today.
The bottom line
A good voice agent is one that people want to use.
AssemblyAI is the leading platform for Voice AI. Build new AI products with voice data leveraging AssemblyAI’s industry-leading Voice AI models for accurate speech-to-text, speaker detection, sentiment analysis, chapter detection, PII redaction, and more.

Methodology

455 survey responses
Conducted Q4 2024 and Q1 2025
Industries represented:
Technology/Software (dominant), Healthcare, Automotive, Financial Services, Telecommunications, Energy, Retail/E-commerce
FORTUNE 500
voice ai specialists
Respondent profile
Hands-on building experience
70%
Actively building (not just evaluating)
87.5%
SENIORITY DISTRIBUTION
Director-level or above
38%
Engineering/Technical Leads
30%
Managers
22%
Individual contributors
10%
A special thank you to Rime for helping distribute the survey.