Models
AssemblyAI offers several state-of-the-art speech recognition models, each optimized for different use cases. Choose the model that best fits your needs based on accuracy, latency, cost, and language requirements.
Pre-recorded models
- Highest accuracy, fastest model
- Supports 6 languages
- Advanced prompting capabilities
- Keyterms prompting up to 1,000 words
- Native code switching
- High accuracy, low latency
- Support across 99 languages
- Keyterms prompting up to 200 words
- Code switching
We recommend Universal-3 Pro for pre-recorded audio transcription. It delivers the highest accuracy and fastest transcription out of the box, with optional prompting for when you need more control. For the broadest language coverage (99 languages), use ["universal-3-pro", "universal-2"] to automatically fall back to Universal-2 for unsupported languages.
Streaming models
- Highest accuracy for voice agents
- Fastest word emissions
- Advanced prompting capabilities
- Keyterms prompting up to 100 words
- 6 languages: en, es, pt, de, fr, it
- Good balance of speed and cost-effectiveness
- Multilingual real-time transcription
- Keyterms prompting up to 100 words
- 6 languages: en, es, pt, de, fr, it
- Good balance of speed and cost-effectiveness
- English transcription
- Keyterms prompting up to 100 words
- Intelligent endpointing
- Open-source Whisper with AssemblyAI infrastructure
- 99+ languages at an accessible price point
- Automatic language detection
- Unlimited scale
We recommend Universal-3 Pro Streaming for streaming transcription. It provides the highest accuracy with sub-300ms latency, native multilingual code switching, and advanced prompting support.
Add-on models
Add-on models enhance transcription accuracy for specialized domains. They work alongside your chosen speech model and are billed separately.
Medical Mode
Medical Mode (domain: "medical-v1") is an add-on that enhances transcription accuracy for medical terminology — including medication names, procedures, conditions, and dosages. It is optimized for medical entity recognition to correct terms that other models frequently get wrong.
Supported models:
- Pre-recorded: Universal-3 Pro, Universal-2
- Streaming: Universal-3 Pro Streaming, Universal-Streaming English, Universal-Streaming Multilingual
Supported languages: English, Spanish, German, French
Medical Mode is billed as a separate add-on. See the pricing page for details.
Learn more: Medical Mode for pre-recorded audio | Medical Mode for streaming
Choosing the right model
Pre-recorded
Universal-3 Pro
Universal-3 Pro is our most powerful Voice AI model, designed to capture the “hard stuff” that traditional ASR models struggle with. It delivers state-of-the-art accuracy for entities, rare words, and domain-specific terminology out of the box, with code switching and optional prompting for more control. It’s also our fastest model, so you get the best accuracy without sacrificing speed.
Best for:
- Highest-accuracy transcription where quality > speed
- Post-call analytics and conversation intelligence
- Meeting notetakers
- Medical transcription
- Recruiting and interviews — high-quality diarization + entity accuracy
- Domain-specific accuracy via keyterm prompting (up to 1,000 words) — entities, proper nouns, rare terms
- Code-switching across EN/ES/DE/FR/PT/IT
Supported languages
enesdefrptitRegional dialects
Universal-3 Pro also supports regional dialects and local speech variants out of the box — no special configuration needed. See the full list of supported dialects.
Universal-2
Universal-2 offers accurate, cost-effective transcription across 99 languages with low latency. It supports code switching and optional keyterms prompting for domain-specific vocabulary (up to 200 words). Universal-2 is the go-to choice when you need reliable transcription across diverse languages.
Best for:
- High accuracy at lower cost with broad language support
- High-volume, price-sensitive batch transcription
- Support for over 99 languages
- Recommended fallback when a requested language isn’t supported by Universal-3 Pro
Supported languages
enen_auen_uken_usesfrdeitptnlhijazhfikoplrutrukviafsqamarhyasazbaeubebnbsbrbgmycahrcsdaetfoglkaelguhthahawhehuisidjwknkkkmlolalvlnltlbmkmgmsmlmtmimrmnnenonnocpapsfarosasrsnsdsiskslsosuswsvde_chtltgtatttethbotkuruzcyyiyoStreaming
Universal-3 Pro Streaming
The most accurate model with the fastest word emissions for voice agents that demand the highest quality. Best-in-class accuracy with advanced prompting capabilities, including both keyterms prompting and native prompting. Supports English, Spanish, German, French, Portuguese, and Italian.
Best for:
- Real-time voice agents
- Applications requiring premium accuracy
- Customer service voice agents needing elite entity accuracy
- IVR replacement / binary response detection in short utterances
- Agent assist and sales intelligence needing real-time speaker diarization, mid-session dynamic prompting
- Multilingual voice agents — native EN/ES/DE/FR/PT/IT code-switching
- Compliance and verbatim recording — disfluency control via prompting
Supported languages
enesdefrptitRegional dialects
Universal-3 Pro Streaming also supports regional dialects and local speech variants out of the box — no special configuration needed. See the full list of supported dialects.
Learn more about Universal-3 Pro Streaming
Universal-Streaming Multilingual
A multilingual transcription model offering a good balance of speed and cost-effectiveness. Supports English, Spanish, German, French, Portuguese, and Italian. Features intelligent endpointing and keyterms prompting support for up to 100 words.
Best for:
- Cost-effective real-time transcription across languages
- Cost-sensitive multilingual streaming across EN/ES/DE/FR/PT/IT
Supported languages
enesdefrptitLearn more about Universal-Streaming Multilingual
Universal-Streaming English
An English transcription model offering a good balance of speed and cost-effectiveness. Features ~300ms word-by-word immutable transcripts, intelligent endpointing, and keyterms prompting support for up to 100 words.
Best for:
- Cost-effective real-time transcription for English
- English-only real-time apps — fastest and cheapest streaming option for English
Supported languages
enLearn more about Universal-Streaming English
Whisper Streaming
An open-source Whisper model enhanced with AssemblyAI’s reliable infrastructure and unlimited scale. Supports 99+ languages at an accessible price point with automatic language detection and non-speech tags.
Best for:
- Multilingual applications and open-source flexibility
- Customers who prefer open-source models
- Cost-sensitive multilingual transcription
Supported languages
enesfrdeitptnlhijazhfikoplrutrukviafsqamarhyasazbaeubebnbsbrbgmyyuecahrcsdaetfoglkaelguhthahawhehuisidjwknkkkmlolalvlnltlbmkmgmsmlmtmimrmnnenonnocpapsfarosasrsnsdsiskslsosuswsvtltgtatttethbotkuruzcyyiyoLearn more about Whisper Streaming
Pricing
For detailed pricing information, visit our pricing page.
Pre-recorded
Streaming
Streaming is billed per hour of session duration — the total time your WebSocket connection stays open — not per hour of audio sent. See Streaming Speech-to-Text billing for details.
For volume discounts, please reach out to sales@assemblyai.com.
Next steps
- Explore Speech Understanding features like summarization, sentiment analysis, and more
- Learn about prompting: Universal-3 Pro prompting guide | Universal-3 Pro Streaming prompting guide