February 3, 2026

Introducing Universal-3 Pro: A new class of speech language model optimized for Voice AI

Today we're releasing Universal-3 Pro, a first of its kind promptable speech language model. And we’re so confident you’ll be blown away by its performance that you can use the model free for all of February to see for yourself.

Madison Bernstein

Product Marketing

Universal-3-Pro

Reviewed by

Table of contents

[Visible on live site]

There's signal locked in every voice conversation be it customer calls, medical scribes, AI notetaking, voice agents. Almost every vertical is building with voice data. We know because we’re powering most of them.

Start building for free with Universal-3 Pro today. →

Universal-3 Pro is the first production-quality speech model that adapts its behavior based on the instructions you provide. Every capability in Universal-3-Pro: audio tagging, disfluency capture, speaker labeling, works through prompting. Describe your audio in plain language and the model adjusts its transcription accordingly:

Medical consultation example:

0:00

"prompt": "Produce a transcript for a clinical history evaluation. It's important to capture medication and 
dosage accurately. Every disfluency is meaningful data. Include: fillers (um, uh, er, erm, ah, hmm, mhm, 
like, you know, I mean), repetitions (I I I, the the), restarts (I was- I went), stutters (th-that, b-but,
no-not), and informal speech (gonna, wanna, gotta)"

The model preserves natural speech patterns while correctly transcribing "glycoside"—critical when phonetically similar terms affect patient safety.

Describe what you're transcribing in plain language: "This is a customer support call about technical troubleshooting" or "This is a board meeting discussing Q4 financial results." The model interprets these instructions and adjusts accordingly. No parameter configuration required.

A powerful new way to build with voice:

The old way
Transcribe, then run complex pipelines of regex and LLM calls to extract what you need. Company names get mangled. Jargon becomes gibberish. You build correction layers on correction layers.

Worse: by the time you're fixing errors, you've lost the acoustic information, like tone, hesitation, emphasis, that would have helped you get it right.

The new way

Universal-3 Pro accepts prompts before transcription. Give it context, like names, terminology, topics, format, and it uses that while processing audio, not after.

You get accurate output at the source. The transcript is correct the first time.

The same model, prompted differently, gives you speaker ID, disfluency detection, audio event classification. Instead of separate tools, you get different outputs from one model that understands the full audio. Prompt for verbatim transcription, speakers, or domain-specific output with medical context. One model, one integration.

Keyterm prompting: 45% accuracy improvement on domain-specific terms

Include up to 1,000 words of domain vocabulary through prompting. The model learns error patterns from your examples and applies corrections across related terms, including those not explicitly listed.

Keyterm prompting example:

0:00


   "keyterms_prompt": ["Kelly Byrne-Donoghue"]

Testing shows up to 45% accuracy improvements when prompting is used effectively.

Audio tagging: Control what gets captured beyond words

Control whether the model captures non-speech audio and which events matter for your use case. Universal-3 Pro gives you explicit control, preventing garbage transcription from phone system artifacts and reducing false transcription errors.

Audio Tagging Example:

0:00

"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. 
Include: Tag sounds: [beep]"

Trained on 50+ audio event tags including [laughter] [silence] [beep] [hold music] [noise] [inaudible] and more, with the ability to prompt endless custom tags for domain-specific audio events.

Traditional ASR either hallucinates non-speech audio as random words or strips all acoustic context. Through prompting, you control exactly what gets tagged, preserving meaningful audio events that affect how calls are interpreted.

Disfluency control: Prompt for verbatim or clean transcription

Control conversational speech patterns to capture legally-defensible verbatim records or clean, readable transcripts, all without building custom post-processing logic:

Verbatim example:

0:00

Use this prompt structure to capture natural speech patterns:

Transcribe verbatim:
- Fillers: yes (um, uh, like, you know)
- Repetitions: yes (I I, the the)
- False starts: yes (I was- I went)
- Stutters: yes (th-that, b-but)

‍

This capability matters for applications where exact phrasing affects meaning. In legal transcription, "I agree" versus "I, uh, I guess I agree" carries different evidentiary weight. In sales coaching, hesitation patterns reveal where representatives struggle with objection handling. In medical documentation, precise language affects patient consent records.

Promptable speaker diarization: Track every speaker turn, even the shortest injections

Prompting speaker labels with Universal-3 Pro works especially well in cases where there are frequent interjections, such as quick acknowledgments or single-word responses.

Speaker roles example:

0:00

 "prompt": "Produce a transcript with every disfluency data. Additionally, label speakers with their 
 respective roles. 1. Place [Speaker:role] at the start of each speaker turn. Example format: 
 [Speaker:NURSE] Hello there. How can I help you today? [Speaker:PATIENT] I'm feeling unwell. I have a 
 headache."
}

Code-switching: Support the way your customers actually speak

Universal-3 Pro supports six languages natively: English, Spanish, German, French, Portuguese, and Italian, with built-in code-switching that handles natural language mixing within conversations.

Code-Switching example:

0:00

Common code-switching patterns:

"Hola, can you help me find información about my account balance?"
"Je voudrais un appointment for next week, s'il vous plaît"

Set "language_detection": True, as shown in the code example below, to automatically detect the spoken language from the audio.

‍data = {   "audio_url": "https://assemblyaiassets.com/audios/code_switching.mp3",   "language_detection": 
True,   "speech_models": ["universal-3-pro", "universal"]‍
"prompt": "You are transcribing a recording of a natural conversation between bilingual Spanish-English
speakers. These speakers are fluent in both languages and switch between them naturally, sometimes multiple
times within a single sentence. This phenomenon is called code-switching and is common among bilingual 
communities. Your task is to transcribe every word exactly as spoken, preserving the language of each word. 
Do not translate. Do not normalize. Do not correct perceived errors. Spanish words should be spelled with 
proper Spanish orthography including accent marks. English words should remain in English even when 
surrounded by Spanish. Capture all filler words, hesitations, and repetitions in both languages."

Intelligent language routing: Comprehensive global coverage

For comprehensive global coverage, use the speech_models parameter to access all 99 languages AssemblyAI supports. Simply list multiple speech models in priority order, and the system automatically routes your audio based on language support, eliminating the need for custom language detection systems.

config=aai.TranscriptionConfig( speech_models=["universal-3-pro", "universal"] )

The system attempts to use the models in priority order falling back to the next model when needed. For example, with ["universal-3-pro", "universal"], the system will try to use Universal-3 Pro for languages it supports (English, Spanish, Portuguese, French, German, and Italian), and automatically fall back to Universal for all other languages. This ensures you get the best performing transcription where available while maintaining the widest language coverage.

Why not a multimodal foundation model?

Multimodal foundation models like Gemini can accept audio directly and follow instructions. Why build something specialized?

General-purpose models are designed to be good at many things, which is useful for prototyping, but less so when your business depends on processing thousands of hours of calls daily. These models are generalists built to handle text, images, video, and audio across countless tasks. This creates reliability issues in production: they might try to respond to your audio instead of transcribing it, especially with conversational content. You have to carefully structure outputs as JSON and handle cases where the model doesn't follow your format. When they encounter unclear audio, they often generate plausible-sounding words that weren't actually spoken.

Universal-3 Pro trains exclusively on speech: 100% transcription, speaker identification, and audio understanding. You get instruction-following capability combined with deterministic ASR reliability. The output is always a transcript. Your prompts control how that transcript is generated—not whether you get a transcript at all. This exclusive focus significantly reduces hallucinations, preventing fabricated content from entering your transcripts when accuracy affects compliance, legal records, or clinical documentation.

There's also cost and production reliability to consider. Multimodal models are expensive at scale, and consistent latency, native speaker identification, and handling acoustic edge cases aren't capabilities you get by adding audio as another modality to a foundation model.

For voice-first products, the specialized approach wins.

Best price performance on real-world data

Universal-3 Pro delivers the lowest word error rate on real-world data at a fraction of the cost of competing solutions. The testing methodology uses diverse audio, including call center recordings with background noise, medical consultations with domain terminology, business meetings with multiple speakers, and accented speech across supported languages. This approach measures performance under real-world conditions that production applications actually encounter, not laboratory recordings.

At $0.21/hour, Universal-3 Pro provides best-in-class accuracy at 35-50% lower cost than competing solutions, with no rate limits or upfront commitments. Volume discounts ensure cost-effectiveness as you scale.

Word Accuracy Rate: The percentage of words in the reference transcript that are correctly recognized by the ASR system, accounting for substitutions, deletions, and insertions.

**Entity Accuracy Rate:** The percentage of predefined entities (such as names, dates, or locations) that are correctly detected and transcribed by the ASR system.

How prompting improves upon baseline accuracy

Universal-3 Pro delivers production-ready accuracy out of the box. With targeted prompting, accuracy improves even further.

Our baseline benchmarks show Universal-3 Pro unprompted performance across multiple data sets. When we optimize prompts for specific data sets (providing domain context, relevant terminology, or instructions about audio characteristics), we see measurable word error rate improvements. The extent of improvement varies by use case: prompts tailored for English-Spanish code-switching improve accuracy on bilingual customer service calls, while prompts emphasizing pharmaceutical terminology reduce errors in medical transcription.

This means you can start with strong baseline accuracy and refine results through prompting, rather than requiring custom model training or extensive post-processing pipelines. The time investment in prompt engineering (typically minutes to hours) replaces what would otherwise require days or weeks of custom model development.

We recommend testing with your specific audio in our Playground to validate performance for your use case before implementation.

Features and pricing

Universal-3 Pro delivers superior capabilities at better price-performance than competing solutions:

** Averages based on benchmarking on real-world production audio, see our benchmarks page for more information *GPT-4o pricing estimated based on token costs for typical audio transcription

Proven accuracy gains in production

Teams building on Universal-3 Pro report immediate accuracy improvements on production audio. Junior, a voice intelligence platform for M&A teams, saw measurable gains when testing Universal-3 Pro. For due diligence workflows where consultants and investors rely on expert call transcripts to inform billion-dollar decisions, accuracy improvements directly translate to better insights and reduced risk.

Jiminny, a conversation intelligence platform, highlighted improvements in handling business-critical details. When sales teams use conversation intelligence to coach reps and analyze deal cycles, accurate capture of customer names and company-specific terminology is essential for extracting actionable insights.

‍

These improvements happen out of the box, with additional gains available through targeted prompting.

Why AssemblyAI

Voice AI infrastructure is all we do. We don't build end-user products or split focus across modalities. We're focused entirely on making it faster to build, ship, and scale products with voice data.

We've worked with hundreds of teams in this space. We’ve seen the prompts, the edge cases, and the failure modes. When you build on Universal-3 Pro, you're building on that accumulated knowledge. As we improve the model, those improvements flow to everyone.

Simple pricing, no lock-in. Volume discounts without upfront commits or rate limit negotiations. If we're not the best option for you, it's easy to switch.

Start building with Universal-3 Pro today. →

Get started in 3 ways

Playground: Test Universal-3 Pro with your audio files. No code required
Documentation: Review our Prompt Engineering Guide with industry-specific templates
API: Use your existing API key. Simply update the speech_model parameter

import assemblyai as aai

aai.settings.api_key = "YOUR_API_KEY"

transcriber = aai.Transcriber()

transcript = transcriber.transcribe(
    "https://assembly.ai/doctor.mov",
    config=aai.TranscriptionConfig(
        speech_models=["universal-3-pro", "universal-2"],
        language_detection=True,
        prompt="Transcribe this audio. Context: a medical consultation discussing medications and symptoms."
    )
)

print(transcript.text)

No rate limits, upfront commits, or vendor lock-in. Standard volume discounts apply.

Coming Soon

Additional capabilities launching in the coming months:

Streaming Medical Mode (Beta): Real-time transcription optimized for clinical workflows
Speech-to-Speech API (Beta): End-to-end voice transformation without text intermediaries
Expanded language support: Additional languages

Start building with Universal-3 Pro today. →

Free February offer: Free usage valid for the month of February 2026 only, limited to 5,000 hours of Universal-3 Pro transcription. Valid credit card required to access offer. Usage beyond 5,000 hours will be billed at standard list rates. Volume-based discounts available for Universal-3 Pro—contact sales for details.

Introducing Universal-3 Pro: A new class of speech language model optimized for Voice AI

A powerful new way to build with voice:

Keyterm prompting: 45% accuracy improvement on domain-specific terms

Audio tagging: Control what gets captured beyond words

Disfluency control: Prompt for verbatim or clean transcription

Promptable speaker diarization: Track every speaker turn, even the shortest injections

Code-switching: Support the way your customers actually speak

Intelligent language routing: Comprehensive global coverage

Why not a multimodal foundation model?

Best price performance on real-world data

How prompting improves upon baseline accuracy

Features and pricing

Proven accuracy gains in production

Why AssemblyAI

Get started in 3 ways

Coming Soon