Models

AssemblyAI offers several state-of-the-art speech recognition models, each optimized for different use cases. Choose the model that best fits your needs based on accuracy, latency, cost, and language requirements.

Pre-recorded models

Universal-3 Pro

Highest accuracy, fastest model
Supports 6 languages
Advanced prompting capabilities
Keyterms prompting up to 1,000 words
Native code switching

Universal-2

High accuracy, low latency
Support across 99 languages
Keyterms prompting up to 200 words
Code switching

We recommend Universal-3 Pro for pre-recorded audio transcription. It delivers the highest accuracy and fastest transcription out of the box, with optional prompting for when you need more control. For the broadest language coverage (99 languages), use ["universal-3-pro", "universal-2"] to automatically fall back to Universal-2 for unsupported languages.

Streaming models

Universal-3 Pro Streaming

Highest accuracy for voice agents
Fastest word emissions
Advanced prompting capabilities
Keyterms prompting up to 100 words
6 languages: en, es, pt, de, fr, it

Universal-Streaming Multilingual

Good balance of speed and cost-effectiveness
Multilingual real-time transcription
Keyterms prompting up to 100 words
6 languages: en, es, pt, de, fr, it

Universal-Streaming English

Good balance of speed and cost-effectiveness
English transcription
Keyterms prompting up to 100 words
Intelligent endpointing

Whisper Streaming

Open-source Whisper with AssemblyAI infrastructure
99+ languages at an accessible price point
Automatic language detection
Unlimited scale

We recommend Universal-3 Pro Streaming for streaming transcription. It provides the highest accuracy with sub-300ms latency, native multilingual code switching, and advanced prompting support.

Add-on models

Add-on models enhance transcription accuracy for specialized domains. They work alongside your chosen speech model and are billed separately.

Medical Mode

Improved accuracy for medical terminology
Medications, procedures, conditions, and dosages
Works with pre-recorded and streaming models
4 languages: en, es, de, fr

Medical Mode

Medical Mode (domain: "medical-v1") is an add-on that enhances transcription accuracy for medical terminology — including medication names, procedures, conditions, and dosages. It is optimized for medical entity recognition to correct terms that other models frequently get wrong.

Supported models:

Pre-recorded: Universal-3 Pro, Universal-2
Streaming: Universal-3 Pro Streaming, Universal-Streaming English, Universal-Streaming Multilingual

Supported languages: English, Spanish, German, French

Medical Mode is billed as a separate add-on. See the pricing page for details.

Learn more: Medical Mode for pre-recorded audio | Medical Mode for streaming

Choosing the right model

Pre-recorded

Universal-3 Pro

Universal-3 Pro is our most powerful Voice AI model, designed to capture the “hard stuff” that traditional ASR models struggle with. It delivers state-of-the-art accuracy for entities, rare words, and domain-specific terminology out of the box, with code switching and optional prompting for more control. It’s also our fastest model, so you get the best accuracy without sacrificing speed.

Best for:

Highest-accuracy transcription where quality > speed
Post-call analytics and conversation intelligence
Meeting notetakers
Medical transcription
Recruiting and interviews — high-quality diarization + entity accuracy
Domain-specific accuracy via keyterm prompting (up to 1,000 words) — entities, proper nouns, rare terms
Code-switching across EN/ES/DE/FR/PT/IT

Supported languages

Englishen

Spanishes

Germande

Frenchfr

Portuguesept

Italianit

Regional dialects

Universal-3 Pro also supports regional dialects and local speech variants out of the box — no special configuration needed. See the full list of supported dialects.

Try Universal-3 Pro here

Universal-2

Universal-2 offers accurate, cost-effective transcription across 99 languages with low latency. It supports code switching and optional keyterms prompting for domain-specific vocabulary (up to 200 words). Universal-2 is the go-to choice when you need reliable transcription across diverse languages.

Best for:

High accuracy at lower cost with broad language support
High-volume, price-sensitive batch transcription
Support for over 99 languages
Recommended fallback when a requested language isn’t supported by Universal-3 Pro

Supported languages

Global Englishen

Australian Englishen_au

British Englishen_uk

US Englishen_us

Spanishes

Frenchfr

Germande

Italianit

Portuguesept

Dutchnl

Hindihi

Japaneseja

Chinesezh

Finnishfi

Koreanko

Polishpl

Russianru

Turkishtr

Ukrainianuk

Vietnamesevi

Afrikaansaf

Albaniansq

Amharicam

Arabicar

Armenianhy

Assameseas

Azerbaijaniaz

Bashkirba

Basqueeu

Belarusianbe

Bengalibn

Bosnianbs

Bretonbr

Bulgarianbg

Burmesemy

Catalanca

Croatianhr

Czechcs

Danishda

Estonianet

Faroesefo

Galiciangl

Georgianka

Greekel

Gujaratigu

Haitianht

Hausaha

Hawaiianhaw

Hebrewhe

Hungarianhu

Icelandicis

Indonesianid

Javanesejw

Kannadakn

Kazakhkk

Khmerkm

Laolo

Latinla

Latvianlv

Lingalaln

Lithuanianlt

Luxembourgishlb

Macedonianmk

Malagasymg

Malayms

Malayalamml

Maltesemt

Maorimi

Marathimr

Mongolianmn

Nepaline

Norwegianno

Norwegian Nynorsknn

Occitanoc

Panjabipa

Pashtops

Persianfa

Romanianro

Sanskritsa

Serbiansr

Shonasn

Sindhisd

Sinhalasi

Slovaksk

Sloveniansl

Somaliso

Sundanesesu

Swahilisw

Swedishsv

Swiss Germande_ch

Tagalogtl

Tajiktg

Tamilta

Tatartt

Telugute

Thaith

Tibetanbo

Turkmentk

Urduur

Uzbekuz

Welshcy

Yiddishyi

Yorubayo

Try Universal-2 here

Streaming

Universal-3 Pro Streaming

The most accurate model with the fastest word emissions for voice agents that demand the highest quality. Best-in-class accuracy with advanced prompting capabilities, including both keyterms prompting and native prompting. Supports English, Spanish, German, French, Portuguese, and Italian.

Best for:

Real-time voice agents
Applications requiring premium accuracy
Customer service voice agents needing elite entity accuracy
IVR replacement / binary response detection in short utterances
Agent assist and sales intelligence needing real-time speaker diarization, mid-session dynamic prompting
Multilingual voice agents — native EN/ES/DE/FR/PT/IT code-switching
Compliance and verbatim recording — disfluency control via prompting

Supported languages

Englishen

Spanishes

Germande

Frenchfr

Portuguesept

Italianit

Regional dialects

Universal-3 Pro Streaming also supports regional dialects and local speech variants out of the box — no special configuration needed. See the full list of supported dialects.

Learn more about Universal-3 Pro Streaming

Universal-Streaming Multilingual

A multilingual transcription model offering a good balance of speed and cost-effectiveness. Supports English, Spanish, German, French, Portuguese, and Italian. Features intelligent endpointing and keyterms prompting support for up to 100 words.

Best for:

Cost-effective real-time transcription across languages
Cost-sensitive multilingual streaming across EN/ES/DE/FR/PT/IT

Supported languages

Englishen

Spanishes

Germande

Frenchfr

Portuguesept

Italianit

Learn more about Universal-Streaming Multilingual

Universal-Streaming English

An English transcription model offering a good balance of speed and cost-effectiveness. Features ~300ms word-by-word immutable transcripts, intelligent endpointing, and keyterms prompting support for up to 100 words.

Best for:

Cost-effective real-time transcription for English
English-only real-time apps — fastest and cheapest streaming option for English

Supported languages

Englishen

Learn more about Universal-Streaming English

Whisper Streaming

An open-source Whisper model enhanced with AssemblyAI’s reliable infrastructure and unlimited scale. Supports 99+ languages at an accessible price point with automatic language detection and non-speech tags.

Best for:

Multilingual applications and open-source flexibility
Customers who prefer open-source models
Cost-sensitive multilingual transcription

Supported languages

Global Englishen

Spanishes

Frenchfr

Germande

Italianit

Portuguesept

Dutchnl

Hindihi

Japaneseja

Chinesezh

Finnishfi

Koreanko

Polishpl

Russianru

Turkishtr

Ukrainianuk

Vietnamesevi

Afrikaansaf

Albaniansq

Amharicam

Arabicar

Armenianhy

Assameseas

Azerbaijaniaz

Bashkirba

Basqueeu

Belarusianbe

Bengalibn

Bosnianbs

Bretonbr

Bulgarianbg

Burmesemy

Cantoneseyue

Catalanca

Croatianhr

Czechcs

Danishda

Estonianet

Faroesefo

Galiciangl

Georgianka

Greekel

Gujaratigu

Haitianht

Hausaha

Hawaiianhaw

Hebrewhe

Hungarianhu

Icelandicis

Indonesianid

Javanesejw

Kannadakn

Kazakhkk

Khmerkm

Laolo

Latinla

Latvianlv

Lingalaln

Lithuanianlt

Luxembourgishlb

Macedonianmk

Malagasymg

Malayms

Malayalamml

Maltesemt

Maorimi

Marathimr

Mongolianmn

Nepaline

Norwegianno

Norwegian Nynorsknn

Occitanoc

Panjabipa

Pashtops

Persianfa

Romanianro

Sanskritsa

Serbiansr

Shonasn

Sindhisd

Sinhalasi

Slovaksk

Sloveniansl

Somaliso

Sundanesesu

Swahilisw

Swedishsv

Tagalogtl

Tajiktg

Tamilta

Tatartt

Telugute

Thaith

Tibetanbo

Turkmentk

Urduur

Uzbekuz

Welshcy

Yiddishyi

Yorubayo

Learn more about Whisper Streaming

To learn how to specify a model, click here for pre-recorded audio and here for streaming audio.

Pricing

For detailed pricing information, visit our pricing page.

Pre-recorded

Model	Price per Hour	Volume discounts
Universal-3 Pro	$0.21/hr	Available
Universal-2	$0.15/hr	Available

Streaming

Streaming is billed per hour of session duration — the total time your WebSocket connection stays open — not per hour of audio sent. See Streaming Speech-to-Text billing for details.

Model	Price per Hour (session duration)	Volume discounts
Universal-3 Pro Streaming	$0.45/hr	Available
Universal-Streaming Multilingual	$0.15/hr	Available
Universal-Streaming English	$0.15/hr	Available
Whisper Streaming	$0.30/hr	Available

For volume discounts, please reach out to sales@assemblyai.com.

Next steps

Explore Speech Understanding features like summarization, sentiment analysis, and more
Learn about prompting: Universal-3 Pro prompting guide | Universal-3 Pro Streaming prompting guide

Pre-recorded models

Universal-3 Pro

Highest accuracy, fastest model
Supports 6 languages
Advanced prompting capabilities
Keyterms prompting up to 1,000 words
Native code switching

Universal-2

High accuracy, low latency
Support across 99 languages
Keyterms prompting up to 200 words
Code switching

Streaming models

Universal-3 Pro Streaming

Highest accuracy for voice agents
Fastest word emissions
Advanced prompting capabilities
Keyterms prompting up to 100 words
6 languages: en, es, pt, de, fr, it

Universal-Streaming Multilingual

Good balance of speed and cost-effectiveness
Multilingual real-time transcription
Keyterms prompting up to 100 words
6 languages: en, es, pt, de, fr, it

Universal-Streaming English

Good balance of speed and cost-effectiveness
English transcription
Keyterms prompting up to 100 words
Intelligent endpointing

Whisper Streaming

Open-source Whisper with AssemblyAI infrastructure
99+ languages at an accessible price point
Automatic language detection
Unlimited scale

We recommend Universal-3 Pro Streaming for streaming transcription. It provides the highest accuracy with sub-300ms latency, native multilingual code switching, and advanced prompting support.

Add-on models

Add-on models enhance transcription accuracy for specialized domains. They work alongside your chosen speech model and are billed separately.

Medical Mode

Improved accuracy for medical terminology
Medications, procedures, conditions, and dosages
Works with pre-recorded and streaming models
4 languages: en, es, de, fr

Medical Mode

Supported models:

Pre-recorded: Universal-3 Pro, Universal-2
Streaming: Universal-3 Pro Streaming, Universal-Streaming English, Universal-Streaming Multilingual

Supported languages: English, Spanish, German, French

Medical Mode is billed as a separate add-on. See the pricing page for details.

Learn more: Medical Mode for pre-recorded audio | Medical Mode for streaming

Choosing the right model

Pre-recorded

Universal-3 Pro

Best for:

Highest-accuracy transcription where quality > speed
Post-call analytics and conversation intelligence
Meeting notetakers
Medical transcription
Recruiting and interviews — high-quality diarization + entity accuracy
Domain-specific accuracy via keyterm prompting (up to 1,000 words) — entities, proper nouns, rare terms
Code-switching across EN/ES/DE/FR/PT/IT

Supported languages

Englishen

Spanishes

Germande

Frenchfr

Portuguesept

Italianit

Regional dialects

Universal-3 Pro also supports regional dialects and local speech variants out of the box — no special configuration needed. See the full list of supported dialects.

Try Universal-3 Pro here

Universal-2

Best for:

High accuracy at lower cost with broad language support
High-volume, price-sensitive batch transcription
Support for over 99 languages
Recommended fallback when a requested language isn’t supported by Universal-3 Pro

Supported languages

Global Englishen

Australian Englishen_au

British Englishen_uk

US Englishen_us

Spanishes

Frenchfr

Germande

Italianit

Portuguesept

Dutchnl

Hindihi

Japaneseja

Chinesezh

Finnishfi

Koreanko

Polishpl

Russianru

Turkishtr

Ukrainianuk

Vietnamesevi

Afrikaansaf

Albaniansq

Amharicam

Arabicar

Armenianhy

Assameseas

Azerbaijaniaz

Bashkirba

Basqueeu

Belarusianbe

Bengalibn

Bosnianbs

Bretonbr

Bulgarianbg

Burmesemy

Catalanca

Croatianhr

Czechcs

Danishda

Estonianet

Faroesefo

Galiciangl

Georgianka

Greekel

Gujaratigu

Haitianht

Hausaha

Hawaiianhaw

Hebrewhe

Hungarianhu

Icelandicis

Indonesianid

Javanesejw

Kannadakn

Kazakhkk

Khmerkm

Laolo

Latinla

Latvianlv

Lingalaln

Lithuanianlt

Luxembourgishlb

Macedonianmk

Malagasymg

Malayms

Malayalamml

Maltesemt

Maorimi

Marathimr

Mongolianmn

Nepaline

Norwegianno

Norwegian Nynorsknn

Occitanoc

Panjabipa

Pashtops

Persianfa

Romanianro

Sanskritsa

Serbiansr

Shonasn

Sindhisd

Sinhalasi

Slovaksk

Sloveniansl

Somaliso

Sundanesesu

Swahilisw

Swedishsv

Swiss Germande_ch

Tagalogtl

Tajiktg

Tamilta

Tatartt

Telugute

Thaith

Tibetanbo

Turkmentk

Urduur

Uzbekuz

Welshcy

Yiddishyi

Yorubayo

Try Universal-2 here

Streaming

Universal-3 Pro Streaming

Best for:

Real-time voice agents
Applications requiring premium accuracy
Customer service voice agents needing elite entity accuracy
IVR replacement / binary response detection in short utterances
Agent assist and sales intelligence needing real-time speaker diarization, mid-session dynamic prompting
Multilingual voice agents — native EN/ES/DE/FR/PT/IT code-switching
Compliance and verbatim recording — disfluency control via prompting

Supported languages

Englishen

Spanishes

Germande

Frenchfr

Portuguesept

Italianit

Regional dialects

Universal-3 Pro Streaming also supports regional dialects and local speech variants out of the box — no special configuration needed. See the full list of supported dialects.

Learn more about Universal-3 Pro Streaming

Universal-Streaming Multilingual

Best for:

Cost-effective real-time transcription across languages
Cost-sensitive multilingual streaming across EN/ES/DE/FR/PT/IT

Supported languages

Englishen

Spanishes

Germande

Frenchfr

Portuguesept

Italianit

Learn more about Universal-Streaming Multilingual

Universal-Streaming English

Best for:

Cost-effective real-time transcription for English
English-only real-time apps — fastest and cheapest streaming option for English

Supported languages

Englishen

Learn more about Universal-Streaming English

Whisper Streaming

Best for:

Multilingual applications and open-source flexibility
Customers who prefer open-source models
Cost-sensitive multilingual transcription

Supported languages

Global Englishen

Spanishes

Frenchfr

Germande

Italianit

Portuguesept

Dutchnl

Hindihi

Japaneseja

Chinesezh

Finnishfi

Koreanko

Polishpl

Russianru

Turkishtr

Ukrainianuk

Vietnamesevi

Afrikaansaf

Albaniansq

Amharicam

Arabicar

Armenianhy

Assameseas

Azerbaijaniaz

Bashkirba

Basqueeu

Belarusianbe

Bengalibn

Bosnianbs

Bretonbr

Bulgarianbg

Burmesemy

Cantoneseyue

Catalanca

Croatianhr

Czechcs

Danishda

Estonianet

Faroesefo

Galiciangl

Georgianka

Greekel

Gujaratigu

Haitianht

Hausaha

Hawaiianhaw

Hebrewhe

Hungarianhu

Icelandicis

Indonesianid

Javanesejw

Kannadakn

Kazakhkk

Khmerkm

Laolo

Latinla

Latvianlv

Lingalaln

Lithuanianlt

Luxembourgishlb

Macedonianmk

Malagasymg

Malayms

Malayalamml

Maltesemt

Maorimi

Marathimr

Mongolianmn

Nepaline

Norwegianno

Norwegian Nynorsknn

Occitanoc

Panjabipa

Pashtops

Persianfa

Romanianro

Sanskritsa

Serbiansr

Shonasn

Sindhisd

Sinhalasi

Slovaksk

Sloveniansl

Somaliso

Sundanesesu

Swahilisw

Swedishsv

Tagalogtl

Tajiktg

Tamilta

Tatartt

Telugute

Thaith

Tibetanbo

Turkmentk

Urduur

Uzbekuz

Welshcy

Yiddishyi

Yorubayo

Learn more about Whisper Streaming

To learn how to specify a model, click here for pre-recorded audio and here for streaming audio.

Pricing

For detailed pricing information, visit our pricing page.

Pre-recorded

Model	Price per Hour	Volume discounts
Universal-3 Pro	$0.21/hr	Available
Universal-2	$0.15/hr	Available

Streaming

Streaming is billed per hour of session duration — the total time your WebSocket connection stays open — not per hour of audio sent. See Streaming Speech-to-Text billing for details.

Model	Price per Hour (session duration)	Volume discounts
Universal-3 Pro Streaming	$0.45/hr	Available
Universal-Streaming Multilingual	$0.15/hr	Available
Universal-Streaming English	$0.15/hr	Available
Whisper Streaming	$0.30/hr	Available

For volume discounts, please reach out to sales@assemblyai.com.

Next steps

Explore Speech Understanding features like summarization, sentiment analysis, and more
Learn about prompting: Universal-3 Pro prompting guide | Universal-3 Pro Streaming prompting guide