July 8, 2026

How accurate is speech-to-text in 2026?

Discover speech-to-text accuracy rates in 2026, measurement methods, real-world benchmarks, and optimization strategies for developers building voice-enabled applications.

Kelsey Foster

Growth

Speech-to-Text

Automatic Speech Recognition

Reviewed by

Table of contents

[Visible on live site]

Speech-to-text accuracy tops out around 95-98% word accuracy on clean audio for the best models in 2026, but that headline number hides most of the real story. Accuracy swings dramatically once you move to accented speech, background noise, code-switching, and the messy conversational audio that voice agents actually hear. This guide covers where accuracy really stands, how it's measured, how the leading models benchmark head-to-head, and how to optimize transcription quality in your own production environment.

Speech-to-text accuracy determines whether AI applications succeed or fail in production. As early research demonstrated, there's a direct correlation between lower error rates and a user's ability to complete tasks. Whether you're building meeting transcription, contact center analytics, or voice agents, accuracy directly shapes how users experience your product—and whether they stick with it.

The most accurate speech-to-text model we ship today is Universal-3.5 Pro, our new flagship async model, and its realtime counterpart, Universal-3.5 Pro Realtime. Both cover 18 languages with mid-sentence code-switching, and both lead the field on the benchmarks that matter for real audio—not just clean read speech. We'll get to the numbers below.

What is speech-to-text accuracy?

Speech-to-text accuracy is the measure of how precisely a model converts spoken words into written text—expressed as a percentage, where 100% means a perfect, error-free transcript. It's the foundational metric for evaluating any speech recognition system, and it directly determines whether your application produces output that's useful or frustrating. A difference of just 5-10 percentage points can be the difference between a transcript users trust and one they have to manually correct.

But here's where it gets interesting—accuracy isn't just about getting words right. Modern systems have to handle punctuation, capitalization, speaker changes, background noise, and context-dependent phrases. A model might correctly transcribe "there," "their," and "they're" phonetically but still fail if it picks the wrong spelling for the context.

An 85% accurate system produces about 15 errors per 100 words, making transcripts hard to read and requiring significant manual cleanup. A 95% accurate system produces only 5 errors per 100 words—often just minor punctuation or formatting issues that don't impede understanding.

How is speech-to-text accuracy measured?

Understanding how accuracy is measured helps you evaluate providers, set realistic expectations, and pick the right metric for your use case.

Word Error Rate (WER)

The industry standard for measuring speech recognition accuracy is Word Error Rate (WER). It calculates the percentage of words that are incorrectly transcribed—substituted, inserted, or deleted.

WER formula: (Substitutions + Insertions + Deletions) / Total words in reference × 100

Example calculation:

Reference transcript: "The quick brown fox jumps over the lazy dog" (9 words)
AI transcript: "The quick brown fox jumped over a lazy dog" (9 words)
Errors: 1 substitution ("jumps" → "jumped"), 1 substitution ("the" → "a")
WER: (2 errors ÷ 9 total words) × 100 = 22.2%
Accuracy: 100% - 22.2% = 77.8%

Test Speech-to-Text Accuracy on Your Audio

Upload a file to see how our models transcribe real-world accents, noise, and conversational speech—not just clean benchmarks. No code required.

Try playground

Beyond WER: real-world accuracy metrics

WER gives you a standardized comparison, but it doesn't tell the complete story. Other metrics matter:

Character Error Rate (CER): Measures accuracy at the character level rather than the word level. Useful for languages without clear word boundaries.

Semantic accuracy: Evaluates whether meaning is preserved, even if specific words differ. "Cannot" vs "can't" might register as a WER error but conveys identical meaning.

Semantic WER: An emerging metric that uses an LLM as a judge to evaluate whether meaning is preserved, rather than checking word-for-word. Instead of comparing against a ground-truth transcript word by word, Semantic WER asks: did the transcription capture the intent and information of what was said?

Diarization accuracy (cpWER): For multi-speaker audio, concatenated minimum-permutation WER measures how well the model transcribes and attributes words to the right speaker. It's a tougher, more useful signal than raw diarization error rate (DER) because it penalizes both transcription and speaker-attribution mistakes together.

Domain-specific accuracy: How well the system handles specialized terminology in medical, legal, or technical domains—often measured with keyword WER or a missed-entity rate.

This distinction matters enormously for AI-native applications. When a voice agent passes a transcript to an LLM, a substitution like "yep" for "yes" or "cannot" for "can't" has zero impact on what the LLM understands—but both register as errors in traditional WER. Frameworks like Pipecat's open-source STT benchmark are standardizing Semantic WER, using reasoning models as judges to reduce scoring bias.

One practical insight: in voice agent contexts, a substitution (a plausible guess) is often preferable to a deletion (a missed word). A deletion can cause a "hanging" turn where the agent receives nothing and the conversation stalls. Traditional WER treats both identically (S=1, D=1), but Semantic WER and use-case-specific weighting can reflect that difference.

Metric	What it measures	Best used for
Word Error Rate (WER)	Percentage of incorrectly transcribed words	General comparisons, human-readable transcripts
Character Error Rate (CER)	Character-level accuracy	Languages without clear word boundaries
Semantic WER	Meaning preservation regardless of exact words	Voice agents, LLM-powered pipelines
Keyword WER (KW_WER)	Accuracy on critical domain terms	Medical, legal, and technical transcription
cpWER	Transcription plus speaker attribution	Multi-speaker diarization (meetings, calls)

We've argued at length that leaning on WER alone gives you a misleading picture of quality—there's a full breakdown in Word error rate is broken, and our published head-to-head numbers live on the benchmarks page.

The most accurate speech-to-text models in 2026: a head-to-head

The most accurate speech-to-text model depends on what you're transcribing, but on the hardest real-world conditions—multilingual, code-switched, and multi-speaker audio—Universal-3.5 Pro leads the models we've measured. Here's how it stacks up against Deepgram Nova-3, OpenAI's GPT-4o Transcribe, and ElevenLabs Scribe v2 on the benchmarks that reflect production audio rather than clean read speech.

Async: code-switching normalized WER

Code-switching—switching languages mid-sentence, like Hinglish or Spanglish—is one of the harshest tests of a transcription model. This is normalized WER averaged across five language pairs (lower is better).

Model	Average normalized WER
Universal-3.5 Pro	7.69%
ElevenLabs Scribe v2	8.77%
Universal-3 Pro	9.07%
Soniox v5 Async	10.01%
Deepgram Nova-3 Multilingual	12.22%
Voxtral Mini Transcribe V2	17.91%
Grok STT Batch	35.43%
OpenAI GPT-4o Transcribe	44.58%

Two things stand out. Universal-3.5 Pro leads the field, and the spread between models is enormous—GPT-4o Transcribe scores 44.58% here, roughly six times the error rate of Universal-3.5 Pro on the same task. If your audio involves any language mixing, the model you pick matters far more than a clean-English benchmark would suggest.

Async: speaker diarization (cpWER)

For meetings, interviews, and multi-speaker calls, you need the model to attribute words to the right speaker. Universal-3.5 Pro uses a joint transcript-plus-speaker model optimized for cpWER (averaged across conditions, lower is better).

Model	Average cpWER
Universal-3.5 Pro	30.17
ElevenLabs Scribe v2	35.26
Gladia	36.87
Voxtral Mini Transcribe V2	37.52
Deepgram Nova-3 English	37.92
Grok STT Batch	42.58
Soniox v5 Async	44.58

Realtime: WER on real agent conversations (Pipecat)

For streaming and voice agents, we measure against Pipecat's open STT benchmark, which uses real agent conversations rather than read speech. Universal-3.5 Pro Realtime posts a pooled WER of 6.99%—market-leading among the realtime models we've tested (lower is better).

Metric	Universal-3.5 Pro RT	Deepgram Flux	ElevenLabs Scribe v2	Google Chirp3
Word error rate	6.99%	15.58%	9.76%	9.04%
Entity error rate	15.31%	50.50%	39.70%	21.51%
Names	16.92%	39.21%	38.03%	22.10%
Places	6.28%	14.86%	34.06%	10.04%
Phone numbers	3.55%	10.41%	4.78%	4.95%

The entity numbers are where the gap gets practical. When a caller reads out a phone number or a name, the difference between a 3.55% and a 10%+ error rate is the difference between a voice agent that gets the booking right and one that has to ask the caller to repeat themselves. We deliberately avoid anchoring accuracy claims on any single clean-read dataset—these benchmarks use conversational and code-switched audio because that's what your users actually produce. The full methodology and updated tables live on our benchmarks page.

See the Numbers on Your Own Audio

Run Universal-3.5 Pro against a file that represents your real use case and compare the accuracy yourself. Start free with clear docs and developer-friendly APIs.

Accuracy on accents, noise, and real-world audio

Clean benchmarks are the easy part. The accuracy you'll actually experience depends on accents, background noise, cross-talk, and microphone quality—the conditions benchmarks rarely capture. Universal-3.5 Pro ships several features built specifically to hold accuracy up in messy conditions.

Context carryover: Universal-3.5 Pro Realtime keeps a rolling memory of the conversation so far, on by default with nothing to configure. It uses earlier turns to disambiguate later ones—so once a caller has said an unusual name or product, the model is far less likely to mangle it the second time.

Agent context: For voice agents, you can pass the agent's question to the model with agent_context, so it interprets the reply through the lens of what was just asked. Across 20,000 voice-agent files, passing agent context cut WER by 10.2%, with the biggest gains on exactly the hard cases: fabrications down 18.3%, hallucinations down 17.2%, place entities down 15.5%, and short utterances down 13.7%. One team that paired agent context with prompting cut its utterance error rate from 26% to 9% on production audio.

Voice focus: The voice_focus feature isolates the primary speaker in noisy environments—a near_field mode for headsets and phones, and a far_field mode for rooms, kiosks, and drive-thrus.

Contextual prompting: You can prime the async model with domain or prior context. In an internal healthcare test, feeding prior-visit notes cut missed medical terms by 31%. One customer, Metaview, saw low-confidence tokens fall roughly 47% by supplying meeting metadata as context.

The practical takeaway hasn't changed: always test any model with audio that represents your actual use case. Benchmark scores are a starting point, not a verdict.

WER vs real-world accuracy: reading benchmarks honestly

There's often a significant gap between the accuracy numbers in marketing materials and what you'll experience in production. Benchmarks use clean, standardized audio; the real world is messy. Companies like Veed and CallSource that build products on Voice AI know real-world performance is the only metric that matters for user satisfaction. Your users aren't in a recording studio—they're on conference calls with spotty internet, in noisy cars, or on low-quality mics.

WER also penalizes formatting choices. A model that transcribes "I cannot" scores differently than one that outputs "I can't," even when both are correct. Models that add punctuation, capitalize proper nouns, or annotate speaker labels can score a higher WER against a bare ground truth—despite producing a more useful transcript.

Ground-truth quality is another hidden variable. Human reference files contain inconsistencies—missed disfluencies, formatting preferences, ordering differences—that inflate WER for models that transcribe more faithfully. When reviewing benchmark claims, ask: how was the ground truth generated, and was it normalized before scoring? Those choices can shift WER by several percentage points and make cross-provider comparisons misleading without an identical, controlled setup. We dig into why the metric breaks down and what to use instead in Word error rate is broken.

Scenario	Typical accuracy range	Key challenges
Clean studio recording	95-98%	Minimal background noise, clear speech
Video conference calls	85-92%	Network compression, microphone quality
Phone conversations	80-88%	Audio compression, line quality
Noisy environments	70-85%	Background noise, multiple speakers
Heavily accented speech	75-90%	Varies by accent and training data
Domain-specific content	80-95%	Technical terminology, proper nouns

Human vs AI accuracy: setting realistic expectations

When evaluating speech-to-text models, human transcription is the ultimate benchmark. Professional transcriptionists achieve near-perfect accuracy in optimal conditions—they bring a lifetime of contextual knowledge that lets them decipher mumbled words, navigate cross-talk, and interpret heavy background noise.

Modern Voice AI models have closed this gap dramatically. Universal-3.5 Pro is engineered for high accuracy on challenging audio—getting names, account numbers, medical terms, and accented speech right where other models approximate.

Attribute	Human transcription	AI transcription
Accuracy (optimal conditions)	Near-perfect	Near-human with leading models
Speed	Slower than real-time	Real-time or faster
Scale	Limited by workforce	Virtually unlimited
Cost	Higher per hour	Significantly lower per hour
Contextual understanding	Excellent	Improving rapidly
Consistency	Varies by transcriptionist	Highly consistent

This distinction becomes critical for real-time applications. If you're building a voice agent, you can't wait for a human—you need immediate, highly accurate speech understanding so your LLM responds to what was actually said. That's where Universal-3.5 Pro Realtime comes in: you can integrate our high-accuracy streaming STT with third-party LLM and text-to-speech services to build a complete pipeline, or use the Voice Agent API, which now runs on Universal-3.5 Pro Realtime as its speech foundation. Getting the STT input right is critical—if the transcription is wrong, the entire downstream agent responds incorrectly.

Current accuracy landscape and realistic expectations

Industry-standard datasets

Most accuracy claims reference performance on standardized datasets:

LibriSpeech: Clean, read speech from audiobooks. Models typically achieve 95%+ accuracy here, but it doesn't reflect real-world conditions.

Earnings21: Long-form earnings-call audio with real business terminology, proper nouns, and financial jargon—much closer to production conditions than read speech.

Common Voice: More diverse speakers and accents. Accuracy is generally 5-10 percentage points lower than LibriSpeech.

Switchboard: Conversational telephone speech, significantly harder due to cross-talk, hesitations, and informal language.

Understanding these datasets matters when interpreting claims. A model that performs well on LibriSpeech may struggle with your contact center audio—and vice versa. Always test any model against audio that represents your actual use case.

Factors that impact speech-to-text accuracy

Audio quality factors

Microphone quality: Higher-quality mics capture clearer signals. Built-in laptop mics typically produce lower accuracy than dedicated USB mics or headsets.
Background noise: Even moderate noise from traffic, air conditioning, or office chatter causes errors—especially for quieter speakers.
Audio compression: Heavily compressed formats like low-bitrate MP3s introduce artifacts that confuse models.
Recording environment: Hard surfaces create echo and reverberation; soft furnishings absorb sound and improve clarity.

Speaker-related factors

Accent and dialect: Models trained on limited accent data may struggle with regional variation, though modern systems handle diverse accents far better than earlier generations.
Speaking pace: Very fast or very slow speech reduces accuracy. Most systems perform best at natural, conversational speeds.
Pronunciation clarity: Mumbling or slurred speech hurts accuracy regardless of model quality.
Voice characteristics: Pitch, tone, and speech patterns affect how easily a model processes a given voice.

Content and context factors

Vocabulary complexity: Simple conversational language scores higher than technical jargon or specialized terminology.
Proper nouns: Names of people, companies, and places often cause errors—especially if they're outside the model's training vocabulary.
Numbers and dates: Disambiguating "fifteen" vs "50" or date formats requires context models don't always have.
Language mixing: Code-switching between languages within a conversation reduces accuracy for most models—though, as the benchmark above shows, this is exactly where Universal-3.5 Pro's native code-switching across 18 languages pulls ahead.

Industry applications and accuracy requirements

Different use cases have different accuracy requirements based on their tolerance for errors and the cost of mistakes. The right metric to measure matters as much as the target number.

Use case	Primary metric	Why
Voice agents	Semantic WER, critical-word accuracy	Transcripts pass to LLMs; meaning > exact words; deletions cause turn hangs
Medical / Legal	Keyword WER (KW_WER), missed-entity rate	Drug names and clinical terms matter more than filler words
Meeting transcription	Traditional WER (normalized) + cpWER	Humans read the output directly; speaker attribution matters
Contact center	WER + sentiment accuracy	Word accuracy and tone classification both drive downstream analytics

Contact centers and customer service

Accuracy requirement: 90%+ for automated systems, 85%+ for agent assistance.

Contact centers processing thousands of calls daily need high accuracy for sentiment analysis, compliance monitoring, and automated responses. A McKinsey report found that deploying speech analytics can lift customer satisfaction scores by 10 percent or more and cut operational costs 20 to 30 percent. Entity accuracy is decisive here—when HotelPlanner switched to AssemblyAI, it cited our accuracy on spelled-out data like credit-card and phone numbers and saw a 10% lift in checkout conversion. See more in our contact center solutions.

Meeting transcription and note-taking

Accuracy requirement: 88%+ for readable transcripts, 92%+ for searchable archives.

Meeting tools must balance accuracy with real-time performance, and multi-speaker attribution (cpWER) matters as much as raw word accuracy. McKinsey research notes automated transcription can accelerate analysis nearly 400 percent versus traditional methods.

Voice assistants and commands

Accuracy requirement: 95%+ for critical commands, 90%+ for general queries.

Voice assistants need extremely high accuracy for important actions like purchases or messages. For agents that pass transcripts to LLMs, Semantic WER is often the better metric—meaning preservation matters more than word-level perfection, and a missed word that causes a turn hang is far more damaging than a minor substitution.

Legal and medical transcription

Accuracy requirement: 98%+ due to regulatory and safety requirements.

High-stakes domains need near-perfect accuracy because errors carry legal or medical consequences. One study found one in every 250 words in an AI-assisted clinical document contained a clinically significant error. Medical teams increasingly rely on keyword WER and missed-entity rate alongside traditional WER, and specialized models help: AssemblyAI's Medical Mode is purpose-built for medical terminology, and contextual prompting with prior-visit notes cut missed medical terms by 31% in internal testing. For teams with compliance requirements, AssemblyAI is considered a business associate under HIPAA and offers a Business Associate Addendum (BAA), available to sign without a sales call.

Improving speech-to-text accuracy in your applications

Pre-processing optimization

Audio enhancement: Reduce background noise, normalize volume, and filter out artifacts before transcription. With Universal-3.5 Pro Realtime, voice_focus handles primary-speaker isolation for you.

Format optimization: Use uncompressed or lightly compressed audio when possible. WAV files typically beat heavily compressed MP3s.

Segmentation: Break long files into smaller segments to improve processing and accuracy, particularly for batch transcription.

Implementation best practices

Keyterm prompting: Provide domain-specific terms—product names, acronyms, proper nouns—to improve recognition. Example: Terms: Nguyen, Beauchamp, GERD, AFib. Before: "afib/gerd" → After: "AFib/GERD."
Context carryover and agent context: On Universal-3.5 Pro Realtime, rolling conversation memory is on by default, and passing agent_context cut WER by 10.2% across 20,000 voice-agent files.
Contextual prompting: Prime the model with the audio's domain, formatting preferences, and prior context—no retraining required.
Confidence scoring: Use per-word confidence scores to flag potentially inaccurate transcriptions for human review or extra processing.

Confidence scoring and accuracy monitoring

No model is perfect, so confidence scores help you handle the inevitable errors. For each word, a model can provide a confidence score—typically between 0.0 and 1.0—representing its certainty. You can flag low-confidence words below a threshold (say 0.85) in your UI, route low-confidence transcripts to a human-in-the-loop workflow (critical for high-stakes applications like JusticeText's legal-evidence review), and monitor which audio consistently produces low confidence to spot audio-quality or custom-vocabulary opportunities.

Measuring and monitoring accuracy in production

Establish baselines: Test with representative audio to set accuracy baselines for your specific use case.
Choose the right metric: Traditional WER when humans read the output; Semantic WER for voice agents and LLM pipelines; cpWER for multi-speaker audio.
Track confidence distributions: Shifting patterns over time may indicate audio-quality changes or model drift.
Integrate user feedback: Collect corrections to understand where your system struggles most.
A/B test: Compare models, settings, or preprocessing with controlled tests on identical audio.

The future of speech-to-text accuracy

The most interesting shift in speech-to-text accuracy isn't a bigger number on a clean dataset—it's that the model is stopping being a passive listener. Universal-3.5 Pro already reads the agent's question, carries context across turns, and isolates the speaker it's supposed to hear. The next gains won't come from squeezing another point off LibriSpeech; they'll come from models that understand the situation they're transcribing. As transcripts feed directly into LLMs and agents, evaluation is moving toward meaning preservation (Semantic WER) and task success rather than word-level matching—and the accuracy that wins will be the accuracy measured on your audio, in your conditions, doing your job.

Ready to Test Accuracy With Your Own Audio?

See how Universal-3.5 Pro handles your specific use case, or explore the full head-to-head numbers on our benchmarks page. Sign up free to get started.

Frequently asked questions

How accurate is AssemblyAI speech-to-text?

AssemblyAI's flagship model, Universal-3.5 Pro, reaches 95-98% word accuracy on clean audio and leads competing models on harder real-world conditions—7.69% average normalized WER on code-switched audio and 30.17 average cpWER on multi-speaker diarization, both the best in our published benchmarks. Its realtime counterpart posts a 6.99% pooled WER on Pipecat's open agent-conversation benchmark. Actual accuracy depends on your audio, so the best approach is to test on a file that represents your use case.

What's the most accurate speech-to-text model in 2026?

On the hardest real-world conditions—multilingual, code-switched, and multi-speaker audio—Universal-3.5 Pro is the most accurate model we've measured, beating ElevenLabs Scribe v2, Deepgram Nova-3, and OpenAI GPT-4o Transcribe on code-switching WER, and leading realtime WER against Deepgram Flux, ElevenLabs Scribe v2, and Google Chirp3. No single model wins every scenario, so the "most accurate" model for you is whichever one performs best on your specific audio.

How is speech-to-text accuracy measured (WER)?

Accuracy is most commonly measured with Word Error Rate (WER), calculated as (substitutions + insertions + deletions) divided by the total words in the reference transcript, times 100. Accuracy is 100% minus WER. For AI-native applications, teams increasingly supplement WER with Semantic WER (meaning preservation), cpWER (speaker attribution), and keyword WER (critical domain terms), because raw WER penalizes harmless formatting differences and treats every error the same regardless of impact.

How accurate is speech-to-text on accents and noisy audio?

Accuracy on accented and noisy audio typically ranges from about 70% to 90%, well below the 95%+ you'll see on clean studio recordings. Universal-3.5 Pro narrows that gap with features built for messy conditions: voice_focus isolates the primary speaker in noisy rooms and drive-thrus, context carryover uses earlier turns to disambiguate later ones, and contextual prompting primes the model with domain vocabulary. The only reliable way to know your number is to test the model on audio that matches your real environment.

How accurate is multilingual or code-switched transcription?

Code-switched transcription is one of the toughest tasks in speech-to-text, and accuracy varies wildly by model—from Universal-3.5 Pro's 7.69% average normalized WER to over 44% for OpenAI GPT-4o Transcribe on the same benchmark. Universal-3.5 Pro handles native code-switching across 18 languages, including mid-sentence switches like Hinglish, with no configuration required. If your audio involves any language mixing, the model you choose matters far more than a clean-English score would suggest.

Is Universal-3.5 Pro more accurate than Universal-3 Pro?

Yes. Universal-3.5 Pro is our new flagship and improves on Universal-3 Pro across the board—on code-switching, it scores 7.69% average normalized WER versus 9.07% for Universal-3 Pro, and it expands language coverage from 6 to 18 languages while adding context carryover, agent context, and a more accurate joint diarization model. Universal-3 Pro remains available as a pinnable snapshot, but Universal-3.5 Pro is the default and the more accurate choice for new projects.

‍