Speech-to-text models

Speech-to-text models built for Voice AI apps

Multilingual speech-to-text with speaker diarization, real-time transcription, and LLM integrations for intelligence out of the box—best-in-class word error rate from $0.15/hr.

Start building free

interview.mp3 Done · 12:30

Speaker A 0:04

Welcome back to the show—today we're joined by the team behind Northwind.

Speaker B 0:11

Thanks for having me. We process about 2 million minutes of audio a month.

Speaker A 0:19

That's a lot. Let's get into how you keep accuracy high at that scale.

99 languages Speaker labels LLM-ready

Accurate and fast

Best-in-class word error rate and low latency on real-world audio and live streams, across 99 languages.

Real-time diarization

Identify and label speakers on every audio file, streaming or pre-recorded.

Formatted for LLMs

Pipe clean, formatted transcripts straight into LLMs for summarization, topic extraction, and downstream workflows.

36%

improvement in close rate

See case study

“The new Universal-3.5 Pro speech model from AssemblyAI is best so far in terms of accuracy, latency, and language switching.”

80%

increase in customer satisfaction

“Assembly has saved us countless hours managing models, and provided exceptional accuracy.”

36%

improvement in close rate

See case study

“The new Universal-3.5 Pro speech model from AssemblyAI is best so far in terms of accuracy, latency, and language switching.”

80%

increase in customer satisfaction

“Assembly has saved us countless hours managing models, and provided exceptional accuracy.”

Quickstart

Start transcribing minutes after you sign up

Create a free account and transcribe any file with one API call. Test it in the no-code playground, then copy a ready-made request into your app. Usage-based pricing from $0.15/hr, no minimums.

Start building free

No credit card required

Word error rate

Word error rate is the share of words the model gets wrong against a human reference—the standard measure of transcription accuracy on pre-recorded audio.

Pre-recorded word error rate on English audio.

*Lower is better*

AssemblyAI Universal-3 Pro

4.50%

Mistral Voxtral Mini

5.24%

OpenAI GPT-4o Transcribe

5.34%

Deepgram Nova-3

6.66%

Azure Batch

7.02%

Source: AssemblyAI published benchmarks — assemblyai.com/benchmarks.