INTRODUCING UNIVERSAL-3 Pro

The first of its kind promptable
Speech Language Model

Introducing Universal-3 Pro, a first of its kind promptable speech language model. We’re so confident you’ll be blown away by its performance that we're giving it to you for free for all of February.

Start building for free Learn more

Full control with prompting

Give the model context about names, terminology, topics, and format before processing, and it delivers transcriptions that understand your specific content.

Clinical evaluation history:

00:00

01:46

"prompt": "Produce a transcript for a clinical history evaluation. It's important to capture medication and dosage accurately. Every disfluency is meaningful data. Include: fillers (um, uh, er, erm, ah, hmm, mhm, like, you know, I mean), repetitions (I I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"

Without prompting

"I just want to move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I do. I take Ramipril. Okay. And I take Metformin, and there's another one that begins with G for the diabetes. Glicoside."

With context aware prompting

"I just wanna move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I, I do. I take, um, I take Ramipril. Okay, mhm. And I take Metformin, and there's another one that begins with G for the diabetes. So glycosi — glycosi— glycoside."

Non-speech audio event:

00:00

01:46

"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: Tag sounds: [beep]"

Without audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options."

With audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options. [beep]"

Speech with disfluencies:

00:00

01:46

"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: fillers (um, uh, er, ah, hmm, mhm, like, you know, I mean), repetitions (I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"

Without disfluency prompting

Do you and Quentin still socialize when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we're friends. What do you do with him?

With disfluency prompting

Do you and Quentin still socialize, uh, when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we, we, we're friends. What do you do with him?

Keyterms

00:00

01:23

"keyterms_prompt": ["Kelly Byrne-Donoghue"]

Without keyterms prompting

"Hi, this is Kelly Byrne Donahue"

Without keyterms prompting

"Hi, this is Kelly Byrne-Donoghue"

Caputuring speaker roles:

00:00

01:23

"prompt": "Produce a transcript with every disfluency data. Additionally, label speakers with their respective roles. 1. Place [Speaker:role] at the start of each speaker turn. Example format: [Speaker:NURSE] Hello there. How can I help you today? [Speaker:PATIENT] I'm feeling unwell. I have a headache."}

With traditional speaker labels

Speaker A: 5Mg. And do you take it regularly?
‍
Speaker B: Oh yeah, yeah.
‍
Speaker A: Good.
‍
Speaker B: Every evening.
‍
Speaker A: And no side effects with it?

With speaker labels prompting

Speaker [Nurse]: 5Mg. And do you take it regularly?
‍
Speaker [Patient]: Oh yeah, yeah.
‍
Speaker [Nurse]: Good.
‍
Speaker [Patient]: Every evening.
‍
Speaker [Nurse]: And no side effects with it?

Spanish and english audio:

00:00

01:23

"prompt": Preserve natural code-switching between English and Spanish. Retain spokenlanguage as-is (correct "I was hablando con mi manager").

Without codeswitching

Would definitely think I spoke Spanish if you heard me speak Spanish. But I still make mistakes. Soy wines. Paltro Soy. La fundadora de goop. Thank you. Thank you for doing that.

With codeswitching

You would definitely think I spoke Spanish if you heard me speak Spanish, but I still make mistakes. Soy Gwyneth Paltrow, soy la fundadora de Goop. Thank you. Thank you for doing that.

Source

Clinical evaluation history:

00:00

01:59

"prompt": "Produce a transcript for a clinical history evaluation. It's important to capture medication and dosage accurately. Every disfluency is meaningful data. Include: fillers (um, uh, er, erm, ah, hmm, mhm, like, you know, I mean), repetitions (I I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"

Without prompting

With context aware prompting

Source

Non-speech audio event:

00:00

01:59

"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: Tag sounds: [beep]"

Without audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options."

With audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options. [beep]"

Source

Speech with disfluencies:

00:00

01:59

"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: fillers (um, uh, er, ah, hmm, mhm, like, you know, I mean), repetitions (I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"

Without disfluency prompting

Do you and Quentin still socialize when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we're friends. What do you do with him?

With disfluency prompting

Do you and Quentin still socialize, uh, when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we, we, we're friends. What do you do with him?

Source

Proper noun spelling:

00:00

01:59

"keyterms_prompt": ["Kelly Byrne-Donoghue"]

Without keyterms prompting

"Hi, this is Kelly Byrne Donahue"

Without keyterms prompting

"Hi, this is Kelly Byrne-Donahue"

Source

Caputuring speaker roles:

00:00

01:59

"prompt": "Produce a transcript with every disfluency data. Additionally, label speakers with their respective roles. 1. Place [Speaker:role] at the start of each speaker turn. Example format: [Speaker:NURSE] Hello there. How can I help you today? [Speaker:PATIENT] I'm feeling unwell. I have a headache."}

With traditional speaker labels

Speaker A: 5Mg. And do you take it regularly?
‍
Speaker B: Oh yeah, yeah.
‍
Speaker A: Good.
‍
Speaker B: Every evening.
‍
Speaker A: And no side effects with it?

With speaker labels prompting

Source

Spanish and english audio:

00:00

01:59

"language_detection": True
"prompt": Preserve natural code-switching between English and Spanish. Retain spokenlanguage as-is (correct "I was hablando con mi manager").

Without codeswitching

Would definitely think I spoke Spanish if you heard me speak Spanish. But I still make mistakes. Soy wines. Paltro Soy. La fundadora de goop. Thank you. Thank you for doing that.

With codeswitching

You would definitely think I spoke Spanish if you heard me speak Spanish, but I still make mistakes. Soy Gwyneth Paltrow, soy la fundadora de Goop. Thank you. Thank you for doing that.

A powerful new way to build with voice

Get transcripts that capture what matters for your application. Provide context to preserve disfluencies, audio events, and speech patterns that standard transcription misses.

Get domain-specific accuracy without custom models

Describe your audio in plain language and get specialized outputs across medical, legal, customer intelligence, and business applications, all from one model.

Get the details right the first time: "This is a diabetes management conversation" gives you pharmaceutical-grade accuracy right away
Improve accuracy automatically: Include 1,000 domain terms and see up to 45% fewer errors on specialized vocabulary
Handle production reality: Describe accent patterns, audio quality, or background noise and the model adapts

Capture the nuance, not just the transcript

Tag meaningful audio events and filter system artifacts with simple instructions so you can capture the acoustic information that matters for your use case.

Maximize analysis quality: Tag [beep] and [hold music] so your sentiment analysis extracts insights from actual conversations
Capture context that matters: Include [laughter], [silence], or custom domain events that affect how conversations are interpreted
Support multiple use cases: Toggle between verbatim legal records and clean summaries through prompting to serve different customer needs

Label speakers by role, not just by voice

Track every speaker turn with role-based identification and attribute even the shortest interjections to get the full picture.

Build role-aware products from the start: Understand your medical scribes and support intelligence conversation dynamics from day one
Capture the real view of a conversation: Handle frequent interjections and single-word responses, capturing complete dialogue the way it actually happened
Eliminate post-processing logic: Deliver value immediately by including speaker roles directly in transcripts through prompting

Switch between verbatim and clean transcription

Capture legally-defensible records and readable summaries simultaneously without building post-processing logic

Ship compliant products faster: Verbatim mode preserves fillers, repetitions, false starts, and stutters for transcript you can trust
Deliver better user experiences: Clean mode removes conversational patterns for polished, readable output—meeting notes and summaries your users actually want to read
Context determines format: Legal proceedings need verbatim hesitation, business summaries need clarity. Deliver both from one model

Leading the industry with promptable Voice AI

Universal-3 Pro delivers the highest accuracy on real-world audio at 35-50% lower cost than competing solutions.

Universal-3 Pro

ElevenLabs

OpenAI

Speechmatics

Microsoft

Amazon

Deepgram

92%

93%

94%

95%

Metric	AssemblyAI Universal-3 Pro	Elevenlabs Scribe V2	OpenAI GPT-4o-Transcribe	Speechmatics Enhanced	Microsoft Batch Transcription	Amazon Transcribe	Deepgram Nova-3
Word accuracy rate	94.07%	93.48%	93.13%	92.79%	92.40%	92.40%	92.10

Optimized for real world conversations

Tested on diverse data sets, Universal-3 Pro delivers low missed entity rates on real world audio.

Universal-3 Pro

ElevenLabs

OpenAI

Speechmatics

Microsoft

Amazon

Deepgram

*Truncated at 25% for visualization

Date and time

Locations

Medical Terms

10%

15%

20%

25%

Missed entity rate by entity type (Lower is better)	AssemblyAI Universal-3 Pro	ElevenLabs Scribe V2	OpenAI GPT-4o-Transcribe	Speechmatics Enhanced	Microsoft Batch Transcription	Amazon Transcribe	Deepgram Nova-3
Date and time	7.50%	11.941%	12.29%	17.33%	10.48%	20.76%	18.69%
Locations	8.26%	13.58%	12.15%	19.03%	15.91%	10.49%	13.94%
Medical Terms	13.61%	11.39%	16.50%	23.87%	24.93%	13.94%	16.95%

Built on Voice AI infrastructure that scales with you

Universal-3 Pro is part of AssemblyAI's complete Voice AI platform, featuring everything you need to build, deploy, and scale Voice Applications

Predictable usage-based pricing

Pay only for what you use with transparent per-second billing. No minimum commitments, annual contracts, or surprise infrastructure costs.

Complete Voice AI
Platform

Access constantly improving models plus the complete toolchain for every voice use case, from transcription to speech-to-speech, all through our Voice AI Cloud.

Proven reliability and security

Deploy with confidence on infrastructure that processes millions of hours daily. Built for production scale with 99.9% uptime and SOC 2 compliance.

Unlock the value of voice data

Build what’s next on the platform powering thousands of the industry’s leading of Voice AI apps.

Try our API for free Contact sales

INTRODUCING UNIVERSAL-3 Pro

The first of its kind promptable
Speech Language Model

Full control with prompting

A powerful new way to build with voice

Get domain-specific accuracy without custom models

Capture the nuance, not just the transcript

Label speakers by role, not just by voice

Switch between verbatim and clean transcription

Leading the industry with promptable Voice AI

Optimized for real world conversations

Built on Voice AI infrastructure that scales with you

Predictable usage-based pricing

Complete Voice AI
Platform

Proven reliability and security

More on Universal-3 Pro

What's next

Playground

Start Building

Unlock the value of voice data

INTRODUCING UNIVERSAL-3 Pro

The first of its kind promptable Speech Language Model

Full control with prompting

A powerful new way to build with voice

Get domain-specific accuracy without custom models

Capture the nuance, not just the transcript

Label speakers by role, not just by voice

Switch between verbatim and clean transcription

Leading the industry with promptable Voice AI

Optimized for real world conversations

Built on Voice AI infrastructure that scales with you

Predictable usage-based pricing

Complete Voice AI Platform

Proven reliability and security

More on Universal-3 Pro

What's next

Playground

Start Building

Unlock the value of voice data

The first of its kind promptable
Speech Language Model

Complete Voice AI
Platform