customers
All customer stories
Top Voice AI companies are building with Assembly.
resources
Latest Release
Voice Agent API
Voice agents that get it right, respond instantly, and ship the same day with our new Voice Agent API
resources
customers
All customer stories
Top Voice AI companies are building with Assembly.
resources
Latest Release
Voice Agent API
Voice agents that get it right, respond instantly, and ship the same day with our new Voice Agent API
resources
customers
All customer stories
Top Voice AI companies are building with Assembly.
resources
Latest Release
Voice Agent API
Voice agents that get it right, respond instantly, and ship the same day with our new Voice Agent API
resources
See how our models stack up on accuracy, entity recognition, diarization, multilingual support, latency, and cost.
WER is calculated as (substitutions + insertions + deletions) / total words in the reference transcript. It is the standard metric for evaluating speech-to-text accuracy.
Average normalized WER across selected datasets
*lower is better*
AssemblyAI Universal-3 Pro
Mistral Voxtral Mini
OpenAI GPT-4o Transcribe
Cohere Transcribe
ElevenLabs Scribe V2
Qwen3 ASR
Gladia
Deepgram Nova-3
Azure Batch
Grok
Soniox
| Dataset | AssemblyAI Universal-3 Pro | ElevenLabs Scribe V2 | Mistral Voxtral Mini | OpenAI GPT-4o Transcribe | Cohere Transcribe | Qwen3 ASR | Gladia | Deepgram Nova-3 | Azure Batch | Grok | Soniox |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Synthetic medical | 0.25% | 0.41% | 1.25% | 0.55% | 1.33% | 1.15% | 1.23% | 0.51% | 1.55% | 1.38% | 0.72% |
| Accented English (India) | 5.62% | 5.92% | 6.44% | 6.49% | 6.39% | 6.61% | 6.78% | 7.77% | 8.04% | 6.60% | 53.48% |
| General speech | 6.29% | 7.19% | 6.64% | 7.45% | 8.35% | 7.01% | 11.08% | 8.81% | 8.42% | 9.76% | 7.55% |
| Webinar speech | 5.86% | 9.96% | 6.65% | 6.87% | 5.91% | 8.85% | 6.79% | 9.55% | 10.07% | 10.90% | 7.79% |
| Average | 4.50% | 5.87% | 5.24% | 5.34% | 5.50% | 5.91% | 6.47% | 6.66% | 7.02% | 7.16% | 17.39% |
Missed entity rate measures errors on named entity categories including names, organizations, locations, medical terms, money, occupations, temporal expressions, and URLs. Unlike aggregate WER, it isolates the tokens that carry the most semantic weight in downstream applications.
Missed entity rate by provider
*lower is better*
AssemblyAI Universal-3 Pro
Mistral Voxtral Mini
AssemblyAI Universal-2
OpenAI GPT-4o Transcribe
ElevenLabs Scribe V2
Deepgram Nova-3
NVIDIA Canary 1B
| Entity type | AssemblyAI Universal-3 Pro | Mistral Voxtral Mini | AssemblyAI Universal-2 | OpenAI GPT-4o Transcribe | ElevenLabs Scribe V2 | Deepgram Nova-3 | NVIDIA Canary 1B |
|---|---|---|---|---|---|---|---|
| Name | 23.63% | 25.89% | 25.37% | 29.23% | 20.87% | 25.58% | 36.08% |
| Organization | 17.02% | 20.80% | 20.96% | 23.40% | 16.57% | 19.25% | 31.60% |
| Location | 8.61% | 9.98% | 12.40% | 12.21% | 11.54% | 13.57% | 24.27% |
| Medical | 13.15% | 19.78% | 18.43% | 16.07% | 10.80% | 15.69% | 31.07% |
| Money | 43.56% | 39.22% | 30.54% | 37.30% | 76.88% | 76.82% | 80.40% |
| Occupation | 9.03% | 9.43% | 9.86% | 13.01% | 9.12% | 9.67% | 10.59% |
| Temporal | 8.17% | 6.21% | 9.82% | 13.78% | 20.42% | 19.42% | 29.70% |
| Url | 50.49% | 65.05% | 71.84% | 65.53% | 47.09% | 54.79% | 98.06% |
Diarization segments audio by speaker. We report cpWER (concatenated minimum-permutation word error rate), which jointly evaluates transcription and speaker assignment.
Average cpWER across DiPCo and NOTSOFAR
*lower is better*
AssemblyAI Universal-3 Pro
Deepgram
Gladia
Speechmatics
Grok
Soniox
| Provider | Average cpWER |
|---|---|
| AssemblyAI Universal-3 Pro | 33.34% |
| Deepgram | 43.21% |
| Gladia | 44.04% |
| Speechmatics | 46.11% |
| Grok | 46.23% |
| 57.14% | |
| Soniox | 112.28% |
Global WER across the evaluated multilingual benchmark suite, with language-level breakdowns where available.
Global multilingual WER
*lower is better*
Speechmatics Enhanced
AssemblyAI Universal-3 Pro
OpenAI GPT-4o Transcribe
Mistral Voxtral Mini
ElevenLabs Scribe V2
Cohere Transcribe
OpenAI Whisper-1
Deepgram Nova-3
| Provider | Global WER | Evaluated languages | German | Spanish | French | Italian | Portuguese |
|---|---|---|---|---|---|---|---|
| Speechmatics Enhanced | 8.22% | 5/20 | 10.30% | 5.84% | 10.09% | 7.88% | 7.01% |
| AssemblyAI Universal-3 Pro | 8.23% | 5/20 | 9.19% | 6.21% | 10.63% | 7.79% | 7.34% |
| OpenAI GPT-4o Transcribe | 9.52% | 15/20 | — | — | — | — | — |
| Mistral Voxtral Mini | 10.50% | 10/20 | — | — | — | — | — |
| ElevenLabs Scribe V2 | 11.13% | 19/20 | — | — | — | — | — |
| Cohere Transcribe | 13.36% | — | — | — | — | — | — |
| OpenAI Whisper-1 | 14.39% | — | — | — | — | — | — |
| Deepgram Nova-3 | 15.71% | — | — | — | — | — | — |
Code-switching benchmarks test Common Voice-derived English paired with German, Spanish, French, Italian, and Portuguese.
Average code-switching WER
*lower is better*
AssemblyAI Universal-3 Pro
ElevenLabs Scribe V2
Soniox
Deepgram Nova-3
Qwen3 ASR
Mistral Voxtral Mini
VibeVoice
Cohere Transcribe
Speechmatics
Grok
OpenAI GPT-4o Transcribe
Gladia
AWS Transcribe
Azure Batch
Gemini
WER by language pair and condition *lower is better*
AssemblyAI Universal-3 Pro
ElevenLabs Scribe V2
Soniox
Deepgram Nova-3
Qwen3 ASR
Mistral Voxtral Mini
VibeVoice
Cohere Transcribe
Speechmatics
Grok
OpenAI
Gladia
AWS Transcribe
Azure Batch
Gemini
Choose a metric to see how each provider's accuracy compares at their price point. Bottom-left is best.
Pricing from Artificial Analysis.
Play the preloaded sample and compare the transcripts side by side.
Medication name changes
Ramipril becomes a different medication, which can change the patient record.
import assemblyai as aai
aai.settings.api_key = "YOUR_API_KEY"
config = aai.TranscriptionConfig(
speech_model="universal-3-pro",
)
transcript = aai.Transcriber().transcribe(
"patient-intake.mp3", config=config
)
print(transcript.text)Truth
I take this medication called, I think it's like a Lipitor. And then I take Ramipril for the blood pressure.
AssemblyAI
I take this medication called, like, it's like a Lipitor. And then I take, um, Ramipril for the blood pressure.
Competitor
I take this medication called like, it's like a Lipitor. And then I take olanopril for the blood pressure.
Email address changes
jonjay@freemail.com becomes variants that would bounce or route to the wrong inbox.
import assemblyai as aai
aai.settings.api_key = "YOUR_API_KEY"
config = aai.TranscriptionConfig(
speech_model="universal-3-pro",
format_text=True,
)
transcript = aai.Transcriber().transcribe(
"support-call.mp3", config=config
)
print(transcript.text)Truth
Just use my personal email. It's jonjay@freemail.com. That's J-O-N-J-A-Y@freemail.com. So jonjay@freemail.com.
AssemblyAI
Just use my personal email. It's jonjay@freemail.com. That's J-O-N-J-A-Y@freemail.com. So jonjay@freemail.com.
Competitor
Just use my personal email. It's johnjay@female.com. That's jonjay@freemail.com. So johnj@freemail.com.
Multiple speakers collapse into one label
Distinct turns from Chris, Sally, Donald, and Jenny are merged under speaker 0.
import assemblyai as aai
aai.settings.api_key = "YOUR_API_KEY"
config = aai.TranscriptionConfig(
speech_model="universal-3-pro",
speaker_labels=True,
)
transcript = aai.Transcriber().transcribe(
"meeting.mp3", config=config
)
for u in transcript.utterances:
print(f"Speaker {u.speaker}: {u.text}")Truth
Chris:yeah
Sally:mmm
Donald:i was thinking in about uh four months they're having the county fair perhaps you can set a booth up at the county fair
Chris:that's also a great idea um we would have to coordinate that with the city would anyone um be willing to take care of that
Jenny:yes i can take care of that and talk to the city coordinate it with them i think it's a great idea that we have it at the fair like uh donald mentioned
Chris:um ok so uh what else uh does anybody have any ideas of what
AssemblyAI
A:Yeah, I was thinking about, uh, 4 months they're having the county fair. Perhaps we could set a booth up at the county fair.
B:That's also a great idea. Um, we would have to coordinate that with the city. Would anyone be willing to take care of that?
C:Yes, I can take care of that and talk to the city, coordinate it with them. I think it's a great idea that we have it at the fair like Donald mentioned.
B:Um, okay, so, uh, what else? Uh, does anybody have any ideas of what
Competitor
Deepgram diarization
0:Yeah. Mhmm. I was thinking that in about four months, they're having the county fair. Perhaps you could set a booth up at the county fair. That's also a great idea. We would have to coordinate that with the city. Would anyone be willing to take care of that? Yes. I can take care of that and talk to the city, coordinate it with them. I think it's a great idea that we have it at the fair, like Donald mentioned.
0:Okay. So what else? Does anybody have any ideas of what
A band name changes meaning
NOFX becomes No Effect, turning a name into a phrase.
import assemblyai as aai
aai.settings.api_key = "YOUR_API_KEY"
config = aai.TranscriptionConfig(
speech_model="universal-3-pro",
language_detection=True,
)
transcript = aai.Transcriber().transcribe(
"interview.mp3", config=config
)
print(transcript.text)Truth
Pennywise, NOFX, MC Red, Bad Religion, são as minhas favoritas.
AssemblyAI
Pennywise, No FX, MC Red, Bad Religion, são as minhas favoritas.
Competitor
ElevenLabs Scribe V2
Pennywise, No Effect, MC Red, Bad Religion, são as minhas favoritas.
Airline name changes
Nigeria Airways becomes Ligera Airways inside a Spanish sentence.
import assemblyai as aai
aai.settings.api_key = "YOUR_API_KEY"
config = aai.TranscriptionConfig(
speech_model="universal-3-pro",
language_detection=True,
)
transcript = aai.Transcriber().transcribe(
"travel-show.mp3", config=config
)
print(transcript.text)Truth
La aerolínea vino a reemplazar a Nigeria Airways.
AssemblyAI
La aerolínea vino a reemplazar a Nigeria Airways.
Competitor
ElevenLabs Scribe V2
La aerolínea vino a reemplazar a Ligera Airways.
Test on your own audio. Want to do it right? Read our benchmarking guide in the docs.
WER is calculated as (substitutions + insertions + deletions) / total words in the reference transcript. Streaming models transcribe with less context than pre-recorded. Accuracy evaluated on final transcripts.
Average WER
*lower is better*
AssemblyAI Universal-3 Pro Streaming
Deepgram Flux
AssemblyAI Universal Streaming
Deepgram Nova 3
Cartesia
Deepgram Nova 3 Multi
Semantic WER applies text normalization before scoring: equivalent forms like "fifteen" vs "15" or "Dr." vs "Doctor" are treated as matches. This is the key metric for voice agent pipelines, where meaning matters more than surface formatting.
Semantic WER
*lower is better*
Azure STT
Soniox
AssemblyAI Universal-3 Pro Streaming
Deepgram Nova 3
AWS Transcribe
Google STT
AssemblyAI Universal Streaming
OpenAI GPT-4o Transcribe
ElevenLabs Scribe V2
Source: Pipecat STT Benchmark.
Missed entity rate measures WER on specific entity types during real-time transcription: proper nouns, alphanumerics, email addresses, and postal codes.
Missed entity rate (proper nouns, alphanumerics, emails, addresses)
*lower is better*
AssemblyAI Universal-3 Pro Streaming
Deepgram Nova 3
AssemblyAI Universal Streaming
Deepgram Nova 3 Multi
OpenAI GPT-4o Transcribe
Medical missed entity rate (normalized average) *lower is better*
AssemblyAI Universal-3 Pro Streaming
Deepgram Nova 3
Azure
AssemblyAI Universal Streaming
Deepgram Nova 3 Multi
OpenAI GPT-4o Transcribe
The interval from when a speaker stops talking to when the final transcript segment arrives — the latency that determines how responsive a voice agent feels in a turn-based conversation. Measured as the median across all test utterances.
Median TTCT (ms)
*lower is better*
Deepgram Nova 3
Soniox
AssemblyAI Universal Streaming
Cartesia Ink
ElevenLabs Scribe V2
AssemblyAI Universal-3 Pro Streaming
Speechmatics
OpenAI GPT-4o Transcribe
Google STT
Azure STT
AWS Transcribe
Source: Pipecat STT Benchmark (time to final segment). For production voice agents, P95 latency is often more telling than the median — Universal-3 Pro Streaming measures 534ms at P95.
Choose a metric to see how each provider's accuracy compares at their price point. Bottom-left is best.
Test on your own audio. Want to do it right? Read our benchmarking guide in the docs.
Each audio file was sent to every provider's production API using default settings. No custom model tuning or prompt engineering was applied. All providers were tested on identical audio files under the same conditions.
Transcription outputs were normalized using the OpenAI Whisper text normalizer before WER computation. This removes formatting differences (casing, punctuation, number formats) so that accuracy comparisons reflect transcription quality, not output styling.
We evaluate across benchmark datasets spanning general speech, entity recognition, diarization, multilingual audio, and code switching.
Word error rate
Selected speech sets covering synthetic medical, accented English, general speech, and webinar audio.
Missed entity rate
PrivateAI Named Entities.
Multilingual
Common Voice, FLEURS, and VoxPopuli multilingual speech datasets.
Code Switching
Common Voice-derived English paired with German, Spanish, French, Italian, and Portuguese.
Last updated June 8, 2026.