ModelsUniversal-3 Pro

Universal-3 Pro (Async)

Set up and configure Universal-3 Pro (Async) for pre-recorded audio transcription.

Universal-3 Pro is our most powerful Voice AI model, designed to capture the “hard stuff” that traditional ASR models struggle with. It delivers state-of-the-art accuracy for entities, rare words, and domain-specific terminology out of the box, with prompting support for fully customized transcription output.

Quickstart

Get started with Universal-3 Pro using the code below. This example transcribes a pre-recorded audio file and prints the transcript text to your terminal.

1

Install the required library

$pip install requests
2

Create a new file main.py and paste the code below. Replace <YOUR_API_KEY> with your API key.

3

Run with python main.py.

1import requests
2import time
3
4base_url = "https://api.assemblyai.com"
5headers = {"authorization": "<YOUR_API_KEY>"}
6
7data = {
8 "audio_url": "https://assembly.ai/sports_injuries.mp3",
9 "language_detection": True,
10 "speech_models": ["universal-3-pro", "universal-2"]
11}
12
13response = requests.post(base_url + "/v2/transcript", headers=headers, json=data)
14
15if response.status_code != 200:
16 print(f"Error: {response.status_code}, Response: {response.text}")
17 response.raise_for_status()
18
19transcript_response = response.json()
20transcript_id = transcript_response["id"]
21polling_endpoint = f"{base_url}/v2/transcript/{transcript_id}"
22
23while True:
24 transcript = requests.get(polling_endpoint, headers=headers).json()
25 if transcript["status"] == "completed":
26 print(transcript["text"])
27 break
28 elif transcript["status"] == "error":
29 raise RuntimeError(f"Transcription failed: {transcript['error']}")
30 else:
31 time.sleep(3)
Language support

Universal-3 Pro supports English, Spanish, Portuguese, French, German, and Italian. To access all 99 languages, use "speech_models": ["universal-3-pro", "universal-2"] as shown in the code example. Read more here.

Key capabilities

The model out of the box outperforms all ASR models on the market on accuracy, especially as it pertains to entities and rare words. With prompting, you can get an entirely customized transcription output that rivals near-human-level transcription.

  • Keyterm Prompting: Improve recognition of domain-specific terminology, rare words, and proper nouns
  • Prompting: Guide transcription style, formatting, and output characteristics
What prompts can doDescription
Verbatim transcription and disfluenciesInclude um, uh, false starts, repetitions, stutters
Output style and formattingControl punctuation, capitalization, number formatting
Context aware cluesHelp with jargon, names, and domain expectations
Entity accuracy and spellingImprove accuracy for proper nouns, brands, technical terms
Speaker attributionMark speaker turns and add labels
Audio event tagsMark laughter, music, applause, background sounds
Native code switchingHandle multilingual audio in same transcript
Regional dialect recognitionAccurately transcribe regional dialects like Quebecois French, Brazilian Portuguese, Spanglish, and more. See supported dialects
Numbers and measurementsControl how numbers, percentages, and measurements are formatted
Labeling crosstalkLabel overlapping speech, interruptions, and crosstalk segments

To fine-tune to your use case, see the Prompting section. Not sure where to start? Try our Prompt Generator.

Start with no prompt

We strongly recommend testing with no prompt first. When you omit the prompt parameter, Universal-3 Pro automatically applies a built-in default prompt that is already optimized for accuracy across a wide range of audio types — including verbatim transcription, multilingual code-switching, and challenging audio conditions. For most use cases, the default prompt delivers excellent results out of the box.

If you’re going to build a prompt, start with one of the recommended prompts and then tweak it for your use case. You should not start from scratch with your prompt — use a recommended prompt and then build off of it. Please read the Prompting Guide (Async) if you’d like to build your prompt yourself.

Remember, prompts are primarily instructional, so adding a large amount of context may not make a significant impact on accuracy and could reduce instruction-following coherence. Feel free to layer in additional instructions that you see in the Prompting Guide (Async).

Benchmarking

A note on evaluating modern speech-to-text

Across the industry, we’re seeing that as models improve, they sometimes capture words or phrases that human transcribers originally missed. In WER evaluations, this shows up as insertions, even when the model is technically correct. We’ve also seen substitutions impact scores in cases where formatting differs (e.g., “alright” vs. “all right”), despite no meaningful accuracy difference.

To help address this, we’re actively developing documentation, blog content, and benchmarking tools focused on best practices for evaluating modern speech-to-text systems. We’ll continue sharing these resources as they’re released.

This is increasingly becoming an industry-wide benchmarking challenge as models begin to match, or exceed, human transcription quality in certain scenarios.

For more details on evaluating transcription accuracy, including tips on using semantic WER and handling substitution artifacts, see the Evaluating transcription accuracy section in the Prompting Guide.

Keyterms prompting

Keyterms prompting allows you to provide up to 1,000 words or phrases (maximum 6 words per phrase) using the keyterms_prompt parameter to improve transcription accuracy for those terms and related variations or contextually similar phrases.

Prompt and Keyterms Prompt

The prompt and keyterms_prompt parameters cannot be used in the same request. Please choose either one or the other based on your use case. When you use keyterms_prompt, your boosted words are appended to the default prompt automatically. See How keyterms prompting works behind the scenes below.

Here is an example showing how you can use keyterms prompting to improve transcription accuracy for a name with distinctive spelling and formatting.

Without keyterms prompting:

Hi, this is Kelly Byrne Donahue

With keyterms prompting:

Hi, this is Kelly Byrne-Donoghue
1import requests
2import time
3
4base_url = "https://api.assemblyai.com"
5headers = {"authorization": "<YOUR_API_KEY>"}
6
7data = {
8 "audio_url": "https://assemblyaiassets.com/audios/keyterms_prompting.wav",
9 "language_detection": True,
10 "speech_models": ["universal-3-pro", "universal-2"],
11 "keyterms_prompt": ["Kelly Byrne-Donoghue"]
12}
13
14response = requests.post(base_url + "/v2/transcript", headers=headers, json=data)
15
16if response.status_code != 200:
17 print(f"Error: {response.status_code}, Response: {response.text}")
18 response.raise_for_status()
19
20transcript_response = response.json()
21transcript_id = transcript_response["id"]
22polling_endpoint = f"{base_url}/v2/transcript/{transcript_id}"
23
24while True:
25 transcript = requests.get(polling_endpoint, headers=headers).json()
26 if transcript["status"] == "completed":
27 print(transcript["text"])
28 break
29 elif transcript["status"] == "error":
30 raise RuntimeError(f"Transcription failed: {transcript['error']}")
31 else:
32 time.sleep(3)

How keyterms prompting works behind the scenes

Behind the scenes, the keyterms_prompt parameter works by appending your keyterms to the system prompt. The model receives the combined <prompt> + <keyterms> as a single input, which is how keyterms are boosted during transcription.

You are free to replicate this behavior manually by appending keyterms directly to the end of your prompt parameter. For example:

Always: Transcribe speech with your best guess based on context in all possible scenarios where speech is present in the audio. Make sure to boost the words Kelly Byrne-Donoghue, AssemblyAI, cardiology in the audio.

Your mileage may vary with manually appending keyterms to the prompt. We recommend using the keyterms_prompt parameter instead, which handles the formatting and appending for you and is optimized for keyterm boosting. Manual appending may not produce the same results as the dedicated keyterms_prompt parameter.

Prompting

For a comprehensive guide on crafting effective prompts, including best practices, prompt capabilities, and example prompts, see the Prompting guide.

Universal-3 Pro delivers great accuracy out of the box. To fine-tune transcription results to your use case, provide a prompt with up to 1,500 words of context in plain language. This helps the model consistently recognize domain-specific terminology, apply your preferred formatting conventions, handle code switching between languages, and better interpret ambiguous speech.

Prompt and Keyterms Prompt

The prompt and keyterms_prompt parameters cannot be used in the same request. Please choose either one or the other based on your use case. When you use keyterms_prompt, your boosted words are appended to the default prompt automatically. See How keyterms prompting works behind the scenes below.

Default prompt

When no prompt is provided, Universal-3 Pro automatically applies the following default prompt:

Always: Transcribe speech with your best guess based on context in all possible scenarios where speech is present in the audio.

You can override the default prompt by providing your own prompt value. See the Prompting guide for detailed examples covering verbatim transcription, audio event tags, speaker attribution, output formatting, and more.

Temperature parameter

Control the amount of randomness injected into the model’s response using the temperature parameter.

The temperature parameter accepts values from 0.0 to 1.0, with a default value of 0.0.

Choosing the right temperature value

The temperature parameter controls how deterministic or exploratory the model’s decoding is. Lower values (e.g., 0.0) make the model fully deterministic, which can be useful for strict reproducibility. Slightly higher values (e.g., 0.1) introduce a small amount of exploration.

Low non-zero temperatures often produce better transcription accuracy (lower WER)—in some cases up to ~5% relative improvement—by allowing the model to recover from early decoding mistakes, while still remaining highly stable. Higher values (e.g., > 0.3) increase randomness and may reduce accuracy.

1import requests
2import time
3
4base_url = "https://api.assemblyai.com"
5headers = {"authorization": "<YOUR_API_KEY>"}
6
7data = {
8 "audio_url": "https://assemblyaiassets.com/audios/nlp_prompting.mp3",
9 "language_detection": True,
10 "speech_models": ["universal-3-pro", "universal-2"],
11 "prompt": "Produce a transcript for a clinical history evaluation. It's important to capture medication and dosage accurately. Every disfluency is meaningful data. Include: fillers (um, uh, er, erm, ah, hmm, mhm, like, you know, I mean), repetitions (I I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)",
12 "temperature": 0.1
13}
14
15response = requests.post(base_url + "/v2/transcript", headers=headers, json=data)
16
17if response.status_code != 200:
18 print(f"Error: {response.status_code}, Response: {response.text}")
19 response.raise_for_status()
20
21transcript_response = response.json()
22transcript_id = transcript_response["id"]
23polling_endpoint = f"{base_url}/v2/transcript/{transcript_id}"
24
25while True:
26 transcript = requests.get(polling_endpoint, headers=headers).json()
27 if transcript["status"] == "completed":
28 print(transcript["text"])
29 break
30 elif transcript["status"] == "error":
31 raise RuntimeError(f"Transcription failed: {transcript['error']}")
32 else:
33 time.sleep(3)

Best practices for prompt engineering

See this guide to learn even more about how to craft effective prompts for Universal-3 Pro speech transcription, which includes an AI prompt generator tool.

Support for 99 languages

With the speech_models parameter, you can list multiple speech models in priority order, allowing our system to automatically route your audio based on language support.

Model routing behavior: The system attempts to use the models in priority order falling back to the next model when needed. For example, with ["universal-3-pro", "universal-2"], the system will try to use universal-3-pro for languages it supports (English, Spanish, Portuguese, French, German, and Italian), and automatically fall back to Universal-2 for all other languages. This ensures you get the best performing transcription where available while maintaining the widest language coverage.