Speech-to-Text Evals

Is your WER benchmark lying to you?

The right response isn't to throw out metric-based evaluation. It's to make the evaluation more accurate. That means better truth files, smarter normalization, and a clearer understanding of what WER is and isn't measuring.

Try our newest models live

Today’s top Voice AI companies rely on AssemblyAI’s speech-to-text and speech understanding models to launch groundbreaking products fast and scale with ease.

Streaming Speech-to-Text
Speech-to-Text
Voice Agent

Try stating information like names, dates, and address, along with technical data like codes, commands, formulas, and special formatting to see how our model performs...

Universal-3 Pro Streaming
Clinical evaluation history:
00:00
01:59
"prompt": "Produce a transcript for a clinical history evaluation. It's important to capture medication and dosage accurately. Every disfluency is meaningful data. Include: fillers (um, uh, er, erm, ah, hmm, mhm, like, you know, I mean), repetitions (I I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"
Without prompting

"I just want to move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I do. I take Ramipril. Okay. And I take Metformin, and there's another one that begins with G for the diabetes.  Glicoside."

With context aware prompting

"I just wanna move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I, I do. I take, um, I take Ramipril. Okay, mhm. And I take Metformin, and there's another one that begins with G for the diabetes. So glycosi — glycosi— glycoside."

Non-speech audio event:
00:00
01:59
"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: Tag sounds: [beep]"
Without audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options."

With audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options. [beep]"

Speech with disfluencies:
00:00
01:59
"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: fillers (um, uh, er, ah, hmm, mhm, like, you know, I mean), repetitions (I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"
Without disfluency prompting

Do you and Quentin still socialize when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we're friends. What do you do with him?

With disfluency prompting

Do you and Quentin still socialize, uh, when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we, we, we're friends. What do you do with him?

Proper noun spelling:
00:00
01:59
"keyterms_prompt": ["Kelly Byrne-Donoghue"]
Without keyterms prompting

"Hi, this is Kelly Byrne Donahue"

Without keyterms prompting

"Hi, this is Kelly Byrne-Donahue"

Caputuring speaker roles:
00:00
01:59
"prompt": "Produce a transcript with every disfluency data. Additionally, label speakers with their respective roles. 1. Place [Speaker:role] at the start of each speaker turn. Example format: [Speaker:NURSE] Hello there. How can I help you today? [Speaker:PATIENT] I'm feeling unwell. I have a headache."}
With traditional speaker labels

Speaker A: 5Mg. And do you take it regularly?

Speaker B: Oh yeah, yeah.

Speaker  A: Good.

Speaker B: Every evening.

Speaker A: And no side effects with it?

With speaker labels prompting

Speaker [Nurse]: 5Mg. And do you take it regularly?

Speaker [Patient]: Oh yeah, yeah.

Speaker  [Nurse]: Good.

Speaker [Patient]: Every evening.

Speaker [Nurse]: And no side effects with it?

Spanish and english audio:
00:00
01:59
"language_detection": True
"prompt": Preserve natural code-switching between English and Spanish. Retain spokenlanguage as-is (correct "I was hablando con mi manager").
Without codeswitching

Would definitely think I spoke Spanish if you heard me speak Spanish. But I still make mistakes. Soy wines. Paltro Soy. La fundadora de goop. Thank you. Thank you for doing that.

With codeswitching

You would definitely think I spoke Spanish if you heard me speak Spanish, but I still make mistakes. Soy Gwyneth Paltrow, soy la fundadora de Goop. Thank you. Thank you for doing that.

Everything you need to build voice apps that outpace the competition

Insanely accurate & fast transcription

Best-in-class WER and low latency on real-world audio and live streams.

Real-time diarization

Identify and label speakers on all audio files.

Formatted transcripts

Directly integrate with LLMs for summarization, topic extraction, or CS workflows.

Secure & compliant

Guardrails including PII redaction and profanity filtering. Build HIPAA-ready apps with BAA, SOC 2.

Usage-based pricing

Pay-as-you-go from $0.15/hr with no commitments or minimums.

Get started for free

Speech recognition that understands context

Accuracy is more than just the right words—it’s trust in your data. Our speech recognition API lets users spend less time filling in the gaps and more time putting insight into action.
The industry’s highest Word Accuracy Rate
Model
Overall Word
Accuracy Rate
Alphanumerics Missed
(lower is better)
Medical Missed
(lower is better)
AssemblyAI
Universal-3 Pro
94.07%
7.5%
13.61%
Deepgram
Nova-3
92.01%
18.69%
16.95%

Quality that speaks for itself.

AssemblyAI’s developer-first API lets you start testing in under a minute. Join 200k+ developers building next-generation Voice AI apps. Transcribe speech-to-text, identify and label speakers, redact sensitive info, and integrate with LLMs. All in one stack.

If you have an hour of content, the difference between 99% accuracy and 97% accuracy, it's a lot of time for that person to review. So you could cut down their workflow from taking half an hour, taking 20 minutes, taking 15 minutes– it's huge, right?
Joshua Grossberg
CTO, Kapwing
The 30-40% reduction in speech-to-text errors has significantly improved our production efficiency and client satisfaction. We've achieved industry-leading word error rates for non-English audio, which is critical for serving our enterprise clients.
Ebru Yildirim
Founder and CEO, Ollang
The cost saving is literally the difference between being profitable or not for us, but beyond the economics, AssemblyAI gave us something invaluable: peace of mind. We can focus on building our product instead of worrying about infrastructure limits.
Mark Barbir
CEO, Earmark
On 10 out of 10 onboarding calls, our customers are at some point telling us 'wow that insight was crisp'—and that's because of the accuracy we're getting from AssemblyAI.


Jake Cronin
Co-founder and CEO, Siro

Learn to actually evaluate speech-to-text models

Evaluation Docs

Review our comprehensive getting started guide and technical documentation for detailed implementation information.

Read the docs

AssemblyAI Playground

Use our no-code Playground to see how each model performs with your specific audio and use cases using our interactive testing environment.

Try our Playground

Pricing

Priced at just $0.15 per hour, builders can ship voice agents that feel more natural, finish tasks successfully, and scale without surprise fees.

Get our pricing

Get started for free

Get $50 in free credits and production-ready Voice AI infrastructure from day one.