INTRODUCING UNIVERSAL-3 Pro Streaming

The most accurate real-time transcription model for voice agents

Universal-3 Pro Streaming gives your voice agents the accuracy, speed, and real-time control to handle real conversations at scale — rare word recognition, turn detection, context memory, and more.

Try Universal-3 Pro Streaming

See the difference in real-time

Speak naturally. Universal-3 Pro Streaming captures what other models miss — try credit card numbers, email addresses, passwords, or company names.

Try saying a company name, like "Granola"...

Tap the Mic to start streaming
2:00
Tap the mic to start
0 turns
Clinical evaluation history:
00:00
01:59
"prompt": "Produce a transcript for a clinical history evaluation. It's important to capture medication and dosage accurately. Every disfluency is meaningful data. Include: fillers (um, uh, er, erm, ah, hmm, mhm, like, you know, I mean), repetitions (I I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"
Without prompting

"I just want to move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I do. I take Ramipril. Okay. And I take Metformin, and there's another one that begins with G for the diabetes.  Glicoside."

With context aware prompting

"I just wanna move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I, I do. I take, um, I take Ramipril. Okay, mhm. And I take Metformin, and there's another one that begins with G for the diabetes. So glycosi — glycosi— glycoside."

Non-speech audio event:
00:00
01:59
"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: Tag sounds: [beep]"
Without audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options."

With audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options. [beep]"

Speech with disfluencies:
00:00
01:59
"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: fillers (um, uh, er, ah, hmm, mhm, like, you know, I mean), repetitions (I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"
Without disfluency prompting

Do you and Quentin still socialize when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we're friends. What do you do with him?

With disfluency prompting

Do you and Quentin still socialize, uh, when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we, we, we're friends. What do you do with him?

Proper noun spelling:
00:00
01:59
"keyterms_prompt": ["Kelly Byrne-Donoghue"]
Without keyterms prompting

"Hi, this is Kelly Byrne Donahue"

Without keyterms prompting

"Hi, this is Kelly Byrne-Donahue"

Caputuring speaker roles:
00:00
01:59
"prompt": "Produce a transcript with every disfluency data. Additionally, label speakers with their respective roles. 1. Place [Speaker:role] at the start of each speaker turn. Example format: [Speaker:NURSE] Hello there. How can I help you today? [Speaker:PATIENT] I'm feeling unwell. I have a headache."}
With traditional speaker labels

Speaker A: 5Mg. And do you take it regularly?

Speaker B: Oh yeah, yeah.

Speaker  A: Good.

Speaker B: Every evening.

Speaker A: And no side effects with it?

With speaker labels prompting

Speaker [Nurse]: 5Mg. And do you take it regularly?

Speaker [Patient]: Oh yeah, yeah.

Speaker  [Nurse]: Good.

Speaker [Patient]: Every evening.

Speaker [Nurse]: And no side effects with it?

Spanish and english audio:
00:00
01:59
"language_detection": True
"prompt": Preserve natural code-switching between English and Spanish. Retain spokenlanguage as-is (correct "I was hablando con mi manager").
Without codeswitching

Would definitely think I spoke Spanish if you heard me speak Spanish. But I still make mistakes. Soy wines. Paltro Soy. La fundadora de goop. Thank you. Thank you for doing that.

With codeswitching

You would definitely think I spoke Spanish if you heard me speak Spanish, but I still make mistakes. Soy Gwyneth Paltrow, soy la fundadora de Goop. Thank you. Thank you for doing that.

Built with the capabilities that make or break voice agent deployments

Audio-contextual turn detection, seamless interruption handling, and high reliability on short utterances. Universal-3 Pro Streaming handles what other models can't.

Features
AssemblyAI
Universal-3 Pro Streaming
Deepgram
Nova-3
OpenAI
GPT-4o Transcribe
Microsoft
Azure
ElevenLabs
Scribe V2
Entity accuracy
(Credit card numbers, emails, etc.)

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Industry leading
Low
accuracy
Low
accuracy
Low
accuracy
Low
accuracy
Speaker diarization performance

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Industry Leading
Unreliable
Unreliable
Unreliable
Unlimited concurrency, no rate limits

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Dynamic keyterms prompting
(turn-by-turn)

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Static only

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Real-time prompting

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Usage-based pricing, no contracts

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Commitments
and overages

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Contracts at scale

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

LiveKit / Pipecat / Twilio
native support

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Partial

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Real-time accuracy where voice agents actually operate

Universal-3 Pro Streaming improves over Universal-Streaming, delivering accuracy in conditions voice agents actually face: telephony, accented speech, high-turn-taking conversations, and noisy call center environments.

Missed Entity Rate: Universal-3 Pro vs. Universal-Streaming

Lower is better  ·  % of entities not correctly transcribed

Universal-3 Pro Streaming
Universal-Streaming

Medical

14.78%

19.61%

+4.83

Temporal

8.30%

9.91%

+1.61

ElevenLabs

Name

23.57%

25.49%

+1.61

Microsoft

Location

9.22%

12.99%

+3.77

Microsoft

Deepgram

Occupation

8.74%

10.13%

+1.39

Deepgram

Deepgram

Organization

17.06%

21.41%

+4.35

Deepgram

Deepgram

Money

43.56%

30.44%

-13.12

Deepgram

Deepgram

Email

59.64%

89.09%

+29.45

Deepgram

Phone

34.79%

37.11%

+2.32

Deepgram

OpenAI

URL

49.03%

72.33%

+23.30

OpenAI

Entity Recognition on actual customer data

Names, dates, policy numbers, credit card numbers — the entities that drive outcomes are the ones most models get wrong. Universal-3 Pro Streaming delivers the lowest missed entity rates on real-world audio.

Missed Entity Rate by Category — All Providers

Lower is better  ·  Universal-3-Pro Streaming highlighted

Amazon

Amazon Transcribe

85.03%

Amazon

Amazon

AssemblyAI Universal-2

89.09%

Amazon

AssemblyAI Universal-3-Pro

59.64%

OpenAI

Deepgram Nova-2

62.94%

OpenAI

Amazon

Deepgram Nova-3

60.66%

Amazon

Amazon

ElevenLabs Scribe-2

56.77%

Amazon

Amazon

Microsoft Azure Batch

79.19%

Amazon

Amazon

OpenAI GPT-4o Transcribe

88.58%

Amazon

Amazon

Speechmatics Enhanced

89.59%

Amazon

Amazon

Amazon Transcribe

21.69%

Amazon

Amazon

AssemblyAI Universal-2

19.61%

Amazon

AssemblyAI Universal-3-Pro

14.78%

OpenAI

Deepgram Nova-2

19.27%

OpenAI

Amazon

Deepgram Nova-3

17.08%

Amazon

Amazon

ElevenLabs Scribe-2

11.39%

Amazon

Amazon

OpenAI GPT-4o Transcribe

16.61%

Amazon

Amazon

Microsoft Azure Batch

25.52%

Amazon

Amazon

Speechmatics Enhanced

23.94%

Amazon

Amazon

Amazon Transcribe

10.84%

Amazon

Amazon

AssemblyAI Universal-2

12.99%

Amazon

AssemblyAI Universal-3-Pro

9.22%

OpenAI

Deepgram Nova-2

14.90%

OpenAI

Amazon

Deepgram Nova-3

14.43%

Amazon

Amazon

ElevenLabs Scribe-2

17.30%

Amazon

Amazon

Microsoft Azure Batch

19.97%

Amazon

Amazon

OpenAI GPT-4o Transcribe

12.37%

Amazon

Amazon

Speechmatics Enhanced

20.58%

Amazon

Amazon

Amazon Transcribe

28.61%

Amazon

Amazon

AssemblyAI Universal-2

25.49%

Amazon

AssemblyAI Universal-3-Pro

23.57%

OpenAI

Deepgram Nova-2

25.05%

OpenAI

Amazon

Deepgram Nova-3

25.77%

Amazon

Amazon

ElevenLabs Scribe-2

15.26%

Amazon

Amazon

Microsoft Azure Batch

29.12%

Amazon

Amazon

OpenAI GPT-4o Transcribe

28.94%

Amazon

Amazon

Speechmatics Enhanced

32.89%

Amazon

Word Error Rate (%) 

Lower is better  ·  English, all domains

Amazon

AssemblyAI Whisper Streaming

31.51%

Amazon

Amazon

AssemblyAI Universal-2

8.76%

Amazon

AssemblyAI Universal-3-Pro

8.14%

Amazon

Deepgram Nova-3

10.12%

Amazon

Amazon

Gladia

8.86%

Amazon

OpenAI

OpenAI GPT4o Transcribe

33.14%

OpenAI

Amazon

Soniox

11.69%

Amazon

Amazon

Voxtral Realtime API

7.05%

Amazon

Built for production voice agents

Every feature engineered for the demands of real voice agent infrastructure.

Industry-leading entity accuracy

Best-in-class recognition of credit card numbers, emails, URLs, passwords, and account numbers — the structured data voice agents act on.

Unlimited concurrency, no rate limits

Scale from a single call to millions without hitting limits or renegotiating contracts. Truly pay-as-you-go — no commitments required.

Real-time speaker diarization

Identify and separate speakers mid-conversation. Enable as a per-session toggle — no extra configuration needed.

Dynamic key term prompting

Boost up to 1,000 domain-specific terms, updated turn-by-turn mid-conversation. Unlike static alternatives, ours adapt in real time.

One-line integrations

Native support for LiveKit, PipeCat, Twilio, and Daily. Go from sign-up to a production voice agent in under 15 minutes.

Real-time Prompting
Beta

Guide transcription behavior with natural language in streaming mode. Start with our prompt templates — experiment and share what works.

Sub-200ms end-to-end latency

Best-in-class recognition of credit card numbers, emails, URLs, passwords, and account numbers — the structured data voice agents act on.

Open community models

We've built the best voice AI inference infrastructure in the world — and we're opening it to community models, starting with Whisper Streaming.

Global language coverage

Full prompting with keyterms, diarization, and audio tagging in English, Spanish, German, French, Portuguese, and Italian

More on Universal-3 Pro Streaming

What's next

We’ll be releasing new updates and improvements to Universal-3 Pro Streaming over the coming weeks.

Read the blog

Playground

Access our production-ready Voice AI models for speech recognition, speaker detection, audio summarization, and more—all in our no-code playground.

Try our Playground

Start Building

Explore our comprehensive prompt engineering guide with use case templates, best practices, and an AI-powered prompt generator to optimize accuracy for your application.

Read the docs

Unlock the value of voice data

Build what’s next on the platform powering thousands of the industry’s leading of Voice AI apps.