Introducing Universal-3 Pro
Learn how to transcribe pre-recorded audio using Universal-3 Pro.
Overview
Universal-3 Pro is our most powerful Voice AI model yet, designed to capture the “hard stuff” that traditional ASR models struggle with.
Key Universal-3 Pro Capabilities:
- Keyterm Prompting: Improve recognition of domain-specific terminology, rare words, and proper nouns
- Prompting: Guide transcription style, formatting, and output characteristics
The model out of the box outperforms all ASR models on the market on accuracy, especially as it pertains to entities and rare words. With prompting, you can get an entirely customized transcription output that rivals near-human-level transcription.
Example prompts and behavior
Prompts let you customize how your transcriptions appear, from polished output to complete verbatim. Here are three examples:
Sample prompt 1: Simple transcription
- Human-readable output without disfluencies or speech patterns
- Strong overall WER but higher error rate on rare words
- Best for general transcription prioritizing readability
Sample prompt 2: Enhanced accuracy
- Adds contextual correctness for rare words/entities
- Includes basic disfluencies and speech patterns
- Best for domain-specific content (medical, legal, technical)
Sample prompt 3: Full verbatim with entity accuracy
- Captures all speech patterns, cross-talk, and background noise
- Maximizes contextual and multilingual accuracy
- Best for verbatim needs across varied transcript types
To fine-tune to your use case, see the Prompting section.
Not sure where to start? Try our Prompt Generator.
Quick start
This example shows how you can transcribe a pre-recorded audio file with our Universal-3 Pro model and print the transcript text to your terminal.
Python
JavaScript
Language support
Universal-3 Pro supports English, Spanish, Portuguese, French, German, and Italian. To access all 99 languages, use "speech_models": ["universal-3-pro", "universal-2"] as shown in the code example. Read more here.
Keyterms prompting
Keyterms prompting allows you to provide up to 1,000 words or phrases (maximum 6 words per phrase) using the keyterms_prompt parameter to improve transcription accuracy for those terms and related variations or contextually similar phrases.
Here is an example showing how you can use keyterms prompting to improve transcription accuracy for a name with distinctive spelling and formatting.
Without keyterms prompting:
With keyterms prompting:
Python
JavaScript
Prompting
Universal-3 Pro delivers great accuracy out of the box. To fine-tune transcription results to your use case, provide a prompt with up to 1,500 words of context in plain language. This helps the model consistently recognize domain-specific terminology, apply your preferred formatting conventions, handle code switching between languages, and better interpret ambiguous speech.
Verbatim transcription and disfluencies
Capture natural speech patterns exactly as spoken, including um, uh, false starts, repetitions, stutters. Add examples of the verbatim elements you want to transcribe in the prompt parameter to guide the model.
Without prompt:
With prompt, the model better captures filler words like “uh” and false starts like “we, we, we’re friends”.
Python
JavaScript
Example prompts:
Output style and formatting
Control punctuation, capitalization, and readability without changing words.
Without prompt:
With prompt, the model accurately captures the speaker’s emotional state through punctuation, adding exclamation marks during moments of yelling and emphasis.
Python
JavaScript
Example prompts:
Context aware clues
Help with jargon, names, and domain expectations that are known from the audio file.
Without prompt:
With prompt, adding ‘clinical history evaluation’ as a context clue corrects spelling of ‘Glicoside’ to ‘Glycoside’.
Python
JavaScript
Example prompts:
Entity accuracy and spelling
Improve transcript accuracy for proper nouns, brands, technical terms, and domain vocabulary with prompting.
Without prompt:
With prompt, the model corrects the misrecognition of “Anktiva,” which would otherwise be transcribed as “Entiva”.
Python
JavaScript
Example prompts:
Caution: Over-instructing the model to follow examples can cause hallucinations when these examples are encountered.
Speaker attribution
Prompted speaker diarization identifies who said what in a conversation. It works especially well in cases where there are frequent interjections, such as quick acknowledgments or single-word responses, or when working with limited spoken audio, such as short-duration files.
Without prompt:
With prompt:
Without prompting, it may appear that speaker B said everything. But with prompting, the model correctly identifies this as 5 separate speaker turns, capturing utterances as short as a single word, like “good”.
Python
JavaScript
Audio event tags
Audio tags capture non-speech events like music, laughter, pauses, applause, background noise, and other sounds in your audio. Include examples of audio tags you want to transcribe in the prompt parameter to guide the model.
Without prompt:
With prompting, non-speech events like beeps are called out in the transcript.
Here are some examples of audio tags you can prompt for: [music], [laugher], [applause], [noise], [pause], [inaudible], [sigh], [gasp], [cheering], [sound], [screaming], [bell], [beep], [sound effect], [buzzer], and more.
Python
JavaScript
Code-switching and multilingual
Handle audio where speakers switch between languages.
Without prompt:
With prompt, the model is able to preserve the speaker’s natural code switching between English and French, transcribing each language as spoken.
Python
JavaScript
Example prompts:
Requires language_detection: true on your request. If a single language code is specified, the model will try to transcribe only that language.
Numbers and measurements
Control how numbers, percentages, and measurements are formatted.
Without prompt:
With prompt:
Python
JavaScript
Example prompts:
Difficult audio handling
Provide guidance for unclear audio, overlapping speech, and interruptions.
Without prompt:
With prompt:
Python
JavaScript
Example prompts:
Temperature parameter
Control the amount of randomness injected into the model’s response using the temperature parameter.
The temperature parameter accepts values from 0.0 to 1.0, with a default value of 0.0.
Choosing the right temperature value
The temperature parameter controls how deterministic or exploratory the model’s decoding is. Lower values (e.g., 0.0) make the model fully deterministic, which can be useful for strict reproducibility. Slightly higher values (e.g., 0.1) introduce a small amount of exploration.
Low non-zero temperatures often produce better transcription accuracy (lower WER)—in some cases up to ~5% relative improvement—by allowing the model to recover from early decoding mistakes, while still remaining highly stable. Higher values (e.g., > 0.3) increase randomness and may reduce accuracy.
Python
JavaScript
99 languages coverage
With the speech_models parameter, you can list multiple speech models in priority order, allowing our system to automatically route your audio based on language support.
Model routing behavior: The system attempts to use the models in priority order falling back to the next model when needed. For example, with ["universal-3-pro", "universal-2"], the system will try to use universal-3-pro for languages it supports (English, Spanish, Portuguese, French, German, and Italian), and automatically fall back to Universal-2 for all other languages. This ensures you get the best performing transcription where available while maintaining the widest language coverage.
Best practices for prompt engineering
Check out this guide to learn even more about how to craft effective prompts for Universal-3 Pro speech transcription, which includes an AI prompt generator tool.