INTRODUCING UNIVERSAL-3 Pro

The first of its kind promptable
Speech Language Model

Introducing Universal-3 Pro, a first of its kind promptable speech language model. We’re so confident you’ll be blown away by its performance that we're giving it to you for free for all of February.

Full control with prompting

Give the model context about names, terminology, topics, and format before processing, and it delivers transcriptions that understand your specific content.

Clinical evaluation history:
00:00
01:46
"prompt": "Produce a transcript for a clinical history evaluation. It's important to capture medication and dosage accurately. Every disfluency is meaningful data. Include: fillers (um, uh, er, erm, ah, hmm, mhm, like, you know, I mean), repetitions (I I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"
Without prompting

"I just want to move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I do. I take Ramipril. Okay. And I take Metformin, and there's another one that begins with G for the diabetes.  Glicoside."

With context aware prompting

"I just wanna move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I, I do. I take, um, I take Ramipril. Okay, mhm. And I take Metformin, and there's another one that begins with G for the diabetes. So glycosi — glycosi— glycoside."

Non-speech audio event:
00:00
01:46
"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: Tag sounds: [beep]"
Without audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options."

With audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options. [beep]"

Speech with disfluencies:
00:00
01:46
"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: fillers (um, uh, er, ah, hmm, mhm, like, you know, I mean), repetitions (I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"
Without disfluency prompting

Do you and Quentin still socialize when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we're friends. What do you do with him?

With disfluency prompting

Do you and Quentin still socialize, uh, when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we, we, we're friends. What do you do with him?

Keyterms
00:00
01:23
"keyterms_prompt": ["Kelly Byrne-Donoghue"]
Without keyterms prompting

"Hi, this is Kelly Byrne Donahue"

Without keyterms prompting

"Hi, this is Kelly Byrne-Donoghue"

Caputuring speaker roles:
00:00
01:23
"prompt": "Produce a transcript with every disfluency data. Additionally, label speakers with their respective roles. 1. Place [Speaker:role] at the start of each speaker turn. Example format: [Speaker:NURSE] Hello there. How can I help you today? [Speaker:PATIENT] I'm feeling unwell. I have a headache."}
With traditional speaker labels

Speaker A: 5Mg. And do you take it regularly?

Speaker B: Oh yeah, yeah.

Speaker  A: Good.

Speaker B: Every evening.

Speaker A: And no side effects with it?

With speaker labels prompting

Speaker [Nurse]: 5Mg. And do you take it regularly?

Speaker [Patient]: Oh yeah, yeah.

Speaker  [Nurse]: Good.

Speaker [Patient]: Every evening.

Speaker [Nurse]: And no side effects with it?

Spanish and english audio:
00:00
01:23
"prompt": Preserve natural code-switching between English and Spanish. Retain spokenlanguage as-is (correct "I was hablando con mi manager").
Without codeswitching

Would definitely think I spoke Spanish if you heard me speak Spanish. But I still make mistakes. Soy wines. Paltro Soy. La fundadora de goop. Thank you. Thank you for doing that.

With codeswitching

You would definitely think I spoke Spanish if you heard me speak Spanish, but I still make mistakes. Soy Gwyneth Paltrow, soy la fundadora de Goop. Thank you. Thank you for doing that.

Clinical evaluation history:
00:00
01:59
"prompt": "Produce a transcript for a clinical history evaluation. It's important to capture medication and dosage accurately. Every disfluency is meaningful data. Include: fillers (um, uh, er, erm, ah, hmm, mhm, like, you know, I mean), repetitions (I I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"
Without prompting

"I just want to move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I do. I take Ramipril. Okay. And I take Metformin, and there's another one that begins with G for the diabetes.  Glicoside."

With context aware prompting

"I just wanna move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I, I do. I take, um, I take Ramipril. Okay, mhm. And I take Metformin, and there's another one that begins with G for the diabetes. So glycosi — glycosi— glycoside."

Non-speech audio event:
00:00
01:59
"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: Tag sounds: [beep]"
Without audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options."

With audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options. [beep]"

Speech with disfluencies:
00:00
01:59
"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: fillers (um, uh, er, ah, hmm, mhm, like, you know, I mean), repetitions (I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"
Without disfluency prompting

Do you and Quentin still socialize when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we're friends. What do you do with him?

With disfluency prompting

Do you and Quentin still socialize, uh, when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we, we, we're friends. What do you do with him?

Proper noun spelling:
00:00
01:59
"keyterms_prompt": ["Kelly Byrne-Donoghue"]
Without keyterms prompting

"Hi, this is Kelly Byrne Donahue"

Without keyterms prompting

"Hi, this is Kelly Byrne-Donahue"

Caputuring speaker roles:
00:00
01:59
"prompt": "Produce a transcript with every disfluency data. Additionally, label speakers with their respective roles. 1. Place [Speaker:role] at the start of each speaker turn. Example format: [Speaker:NURSE] Hello there. How can I help you today? [Speaker:PATIENT] I'm feeling unwell. I have a headache."}
With traditional speaker labels

Speaker A: 5Mg. And do you take it regularly?

Speaker B: Oh yeah, yeah.

Speaker  A: Good.

Speaker B: Every evening.

Speaker A: And no side effects with it?

With speaker labels prompting

Speaker [Nurse]: 5Mg. And do you take it regularly?

Speaker [Patient]: Oh yeah, yeah.

Speaker  [Nurse]: Good.

Speaker [Patient]: Every evening.

Speaker [Nurse]: And no side effects with it?

Spanish and english audio:
00:00
01:59
"language_detection": True
"prompt": Preserve natural code-switching between English and Spanish. Retain spokenlanguage as-is (correct "I was hablando con mi manager").
Without codeswitching

Would definitely think I spoke Spanish if you heard me speak Spanish. But I still make mistakes. Soy wines. Paltro Soy. La fundadora de goop. Thank you. Thank you for doing that.

With codeswitching

You would definitely think I spoke Spanish if you heard me speak Spanish, but I still make mistakes. Soy Gwyneth Paltrow, soy la fundadora de Goop. Thank you. Thank you for doing that.

A powerful new way to build with voice

Get transcripts that capture what matters for your application. Provide context to preserve disfluencies, audio events, and speech patterns that standard transcription misses.

Get domain-specific accuracy without custom models

Describe your audio in plain language and get specialized outputs across medical, legal, customer intelligence, and business applications, all from one model.

  • Get the details right the first time: "This is a diabetes management conversation" gives you pharmaceutical-grade accuracy right away
  • Improve accuracy automatically: Include 1,000 domain terms and see up to 45% fewer errors on specialized vocabulary
  • Handle production reality: Describe accent patterns, audio quality, or background noise and the model adapts

Capture the nuance, not just the transcript

Tag meaningful audio events and filter system artifacts with simple instructions so you can capture the acoustic information that matters for your use case.

  • Maximize analysis quality: Tag [beep] and [hold music] so your sentiment analysis extracts insights from actual conversations
  • Capture context that matters: Include [laughter], [silence], or custom domain events that affect how conversations are interpreted
  • Support multiple use cases: Toggle between verbatim legal records and clean summaries through prompting to serve different customer needs

Label speakers by role, not just by voice

Track every speaker turn with role-based identification and attribute even the shortest interjections to get the full picture.

  • Build role-aware products from the start:  Understand your medical scribes and support intelligence conversation dynamics from day one
  • Capture the real view of a conversation: Handle frequent interjections and single-word responses, capturing complete dialogue the way it actually happened
  • Eliminate post-processing logic: Deliver value immediately by including speaker roles directly in transcripts through prompting

Switch between verbatim and clean transcription

Capture legally-defensible records and readable summaries simultaneously without building post-processing logic

  • Ship compliant products faster: Verbatim mode preserves fillers, repetitions, false starts, and stutters for transcript you can trust
  • Deliver better user experiences: Clean mode removes conversational patterns for polished, readable output—meeting notes and summaries your users actually want to read
  • Context determines format: Legal proceedings need verbatim hesitation, business summaries need clarity. Deliver both from one model

Leading the industry with promptable Voice AI

Universal-3 Pro delivers the highest accuracy on real-world audio at 35-50% lower cost than competing solutions.

Universal-3 Pro
ElevenLabs
OpenAI
Speechmatics
Microsoft
Amazon
Deepgram
92%
93%
94%
95%
Metric
AssemblyAI
Universal-3 Pro
Elevenlabs
Scribe V2
OpenAI
GPT-4o-Transcribe
Speechmatics
Enhanced
Microsoft
Batch Transcription
Amazon
Transcribe
Deepgram
Nova-3
Word accuracy rate
94.07%
93.48%
93.13%
92.79%
92.40%
92.40%
92.10

Optimized for real world conversations

Tested on diverse data sets, Universal-3 Pro delivers low missed entity rates on real world audio.

Universal-3 Pro
ElevenLabs
OpenAI
Speechmatics
Microsoft
Amazon
Deepgram
*Truncated at 25% for visualization
Date and time
Locations
Medical Terms
0%
5%
10%
15%
20%
25%
Missed entity rate by entity type
(Lower is better)
AssemblyAI
Universal-3 Pro
ElevenLabs
Scribe V2
OpenAI
GPT-4o-Transcribe
Speechmatics
Enhanced
Microsoft
Batch Transcription
Amazon
Transcribe
Deepgram
Nova-3
Date and time
7.50%
11.941%
12.29%
17.33%
10.48%
20.76%
18.69%
Locations
8.26%
13.58%
12.15%
19.03%
15.91%
10.49%
13.94%
Medical Terms
13.61%
11.39%
16.50%
23.87%
24.93%
13.94%
16.95%

Built on Voice AI infrastructure that scales with you

Universal-3 Pro is part of AssemblyAI's complete Voice AI platform, featuring everything you need to build, deploy, and scale Voice Applications

Predictable usage-based pricing

Pay only for what you use with transparent per-second billing. No minimum commitments, annual contracts, or surprise infrastructure costs.

Complete Voice AI
Platform

Access constantly improving models plus the complete toolchain for every voice use case, from transcription to speech-to-speech, all through our Voice AI Cloud.

Proven reliability and security

Deploy with confidence on infrastructure that processes millions of hours daily. Built for production scale with 99.9% uptime and SOC 2 compliance.

More on Universal-3 Pro

What's next

We’ll be releasing new updates and improvements to Universal-3 Pro over the coming weeks, including more languages, better instruction following, and real-time support.

Read the blog

Playground

Access our production-ready Voice AI models for speech recognition, speaker detection, audio summarization, and more—all in our no-code playground.

Try our Playground

Start Building

Explore our comprehensive prompt engineering guide with use case templates, best practices, and an AI-powered prompt generator to optimize accuracy for your application.

Read the docs

Unlock the value of voice data

Build what’s next on the platform powering thousands of the industry’s leading of Voice AI apps.