AssemblyAI offers several state-of-the-art speech recognition models, each optimized for different use cases. Choose the model that best fits your needs based on accuracy, latency, cost, and language requirements.
We recommend Universal-3 Pro for pre-recorded audio transcription. It delivers the highest accuracy and fastest transcription out of the box, with optional prompting for when you need more control. For the broadest language coverage (99 languages), use ["universal-3-pro", "universal-2"] to automatically fall back to Universal-2 for unsupported languages.
We recommend Universal-3 Pro Streaming for streaming transcription. It provides the highest accuracy with sub-300ms latency, native multilingual code switching, and advanced prompting support.
Add-on models enhance transcription accuracy for specialized domains. They work alongside your chosen speech model and are billed separately.
Medical Mode (domain: "medical-v1") is an add-on that enhances transcription accuracy for medical terminology — including medication names, procedures, conditions, and dosages. It is optimized for medical entity recognition to correct terms that other models frequently get wrong.
Supported models:
Supported languages: English, Spanish, German, French
Medical Mode is billed as a separate add-on. See the pricing page for details.
Learn more: Medical Mode for pre-recorded audio | Medical Mode for streaming
Universal-3 Pro is our most powerful Voice AI model, designed to capture the “hard stuff” that traditional ASR models struggle with. It delivers state-of-the-art accuracy for entities, rare words, and domain-specific terminology out of the box, with code switching and optional prompting for more control. It’s also our fastest model, so you get the best accuracy without sacrificing speed.
Best for:
enesdefrptitUniversal-3 Pro also supports regional dialects and local speech variants out of the box — no special configuration needed. See the full list of supported dialects.
Universal-2 offers accurate, cost-effective transcription across 99 languages with low latency. It supports code switching and optional keyterms prompting for domain-specific vocabulary (up to 200 words). Universal-2 is the go-to choice when you need reliable transcription across diverse languages.
Best for:
enen_auen_uken_usesfrdeitptnlhijazhfikoplrutrukviafsqamarhyasazbaeubebnbsbrbgmycahrcsdaetfoglkaelguhthahawhehuisidjwknkkkmlolalvlnltlbmkmgmsmlmtmimrmnnenonnocpapsfarosasrsnsdsiskslsosuswsvde_chtltgtatttethbotkuruzcyyiyoThe most accurate model with the fastest word emissions for voice agents that demand the highest quality. Best-in-class accuracy with advanced prompting capabilities, including both keyterms prompting and native prompting. Supports English, Spanish, German, French, Portuguese, and Italian.
Best for:
enesdefrptitUniversal-3 Pro Streaming also supports regional dialects and local speech variants out of the box — no special configuration needed. See the full list of supported dialects.
Learn more about Universal-3 Pro Streaming
A multilingual transcription model offering a good balance of speed and cost-effectiveness. Supports English, Spanish, German, French, Portuguese, and Italian. Features intelligent endpointing and keyterms prompting support for up to 100 words.
Best for:
enesdefrptitLearn more about Universal-Streaming Multilingual
An English transcription model offering a good balance of speed and cost-effectiveness. Features ~300ms word-by-word immutable transcripts, intelligent endpointing, and keyterms prompting support for up to 100 words.
Best for:
enLearn more about Universal-Streaming English
An open-source Whisper model enhanced with AssemblyAI’s reliable infrastructure and unlimited scale. Supports 99+ languages at an accessible price point with automatic language detection and non-speech tags.
Best for:
enesfrdeitptnlhijazhfikoplrutrukviafsqamarhyasazbaeubebnbsbrbgmyyuecahrcsdaetfoglkaelguhthahawhehuisidjwknkkkmlolalvlnltlbmkmgmsmlmtmimrmnnenonnocpapsfarosasrsnsdsiskslsosuswsvtltgtatttethbotkuruzcyyiyoLearn more about Whisper Streaming
For detailed pricing information, visit our pricing page.
Streaming is billed per hour of session duration — the total time your WebSocket connection stays open — not per hour of audio sent. See Streaming Speech-to-Text billing for details.
For volume discounts, please reach out to sales@assemblyai.com.