Models
AssemblyAI offers several state-of-the-art speech recognition models, each optimized for different use cases. Choose the model that best fits your needs based on accuracy, latency, cost, and language requirements.
Best for out-of-the-box transcription of pre-recorded audio with multi-lingual support, excellent accuracy, and low latency
Highest accuracy for transcribing English pre-recorded audio with fine-tuning support and customization via prompting
Streaming audio transcription optimized for voice agents and real-time applications
Choosing the right model
Universal
- Best for: Production-ready transcription out of the box
- Key benefits:
- Excellent accuracy-to-latency ratio
- Multi-language support
- No configuration needed
- Ideal for conversational intelligence
Breakdown of Universal language support
High accuracy (≤ 10% WER)
English, Spanish, French, German, Indonesian, Italian, Japanese, Dutch, Polish, Portuguese, Russian, Turkish, Ukrainian, Catalan
Good accuracy (>10% to ≤25% WER)
Arabic, Azerbaijani, Bulgarian, Bosnian, Mandarin Chinese, Czech, Danish, Greek, Estonian, Finnish, Filipino, Galician, Hindi, Croatian, Hungarian, Korean, Macedonian, Malay, Norwegian Bokmål, Romanian, Slovak, Swedish, Thai, Urdu, Vietnamese
Moderate accuracy (>25% to ≤50% WER)
Afrikaans, Belarusian, Welsh, Persian (Farsi), Hebrew, Armenian, Icelandic, Kazakh, Lithuanian, Latvian, Māori, Marathi, Slovenian, Swahili, Tamil
Fair accuracy (>50% WER)
Albanian, Amharic, Assamese, Bashkir, Basque, Bengali, Breton, Burmese, Faroese, Georgian, Gujarati, Haitian, Hawaiian, Hausa, Javanese, Kannada, Khmer, Lao, Latin, Lingala, Luxembourgish, Malagasy, Malayalam, Maltese, Mongolian, Nepali, Norwegian Nynorsk, Occitan, Punjabi, Pashto, Sanskrit, Serbian, Sindhi, Sinhala, Shona, Somali, Sundanese, Tajik, Tatar, Telugu, Tibetan, Turkmen, Uzbek, Yiddish, Yoruba
Slam-1 (Beta)
- Best for: English content requiring highest accuracy
- Key benefits:
- Superior accuracy for English content
- Fine-tuning support
- Ideal for domain-specific terminology
Streaming
- Best for: Voice agents and real-time voice applications
- Key benefits:
- ~300ms immutable transcripts
- Continuous speech recognition
- Intelligent endpointing
- Ideal for voice agents and interactive applications
Pricing
For detailed pricing information, visit our pricing page.
For volume discounts, please reach out to sales@assemblyai.com.
Next steps
- For pre-recorded audio, see how to select your model
- For real-time transcription, check out our streaming documentation