Pricing built for innovation, not contract negotiation

Examine the performance of our Speech AI models across key metrics including accuracy, word error rate, and more.

Pre-recorded Speech-to-Text API

Build Voice AI on the most accurate Speech-to-Text with language detection, formatting, filler words, keyterms prompting, custom spelling, word-level timestamps, and more.

Models	Pay as you go	Custom
Universal-3 Pro Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations. Our most accurate speech-to-text model, leading the market in multilingual accuracy on WER, entities, rare words, alphanumerics, and messy speech in real-world audio. Currently supports English, Spanish, German, French, Italian, and Portuguese with more languages coming soon. Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.	$0.21/hr	Get custom rate limits, enhanced concurrency, and enterprise-grade flexibility tailored to your AI workloads Contact us
Universal-2 Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations. Our highly accurate speech-to-text model trained on over 12.5 million hours of audio data. Supports 99 languages. Exceptional accuracy at a lower price. Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.	$0.15/hr

ADD-ON FEATURES	Universal-3 Pro	Universal-2
Keyterms Prompting Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations. Provide up to 1000 words or phrases (maximum 6 words per phrase) to improve transcription accuracy.	$0.05/hr	Included
Prompting Beta Control transcription behavior with plain language instructions: provide context, tag audio events, and more.	$0.05/hr	$0.05/hr
Speaker Diarization Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations. Detect multiple speakers in audio files and segment the transcript into utterances, showing what each speaker said.	$0.02/hr	$0.02/hr
Medical Mode Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations. New Optimize transcription for medical terminology and healthcare conversations with significantly improved accuracy.	$0.15/hr	$0.15/hr

Ready to start building with voice data?

Get started today with $50 in free credits. No credit card required.

Get your API key

Streaming Speech-to-Text API

Transcribe live audio and video files in real-time at ultra-low latency and high-quality accuracy. Leverage auto punctuation and casing, next-gen end-of-turn detection, and ITM/formatting.

Models	Pay as you go	Custom
Universal-3 Pro Streaming New Speaker Identification allows you to identify speakers by their actual names or roles, transforming generic labels like “Speaker A” or “Speaker B” into meaningful identifiers that you provide. The most accurate model for voice agents that demand the highest quality. Best-in-class accuracy with advanced prompting capabilities. Supports English, Spanish, German, French, Portuguese, and Italian.	$0.45/hr	Get custom rate limits, enhanced concurrency, and enterprise-grade flexibility tailored to your AI workloads Contact us
Universal-Streaming Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations. Speaker Identification allows you to identify speakers by their actual names or roles, transforming generic labels like “Speaker A” or “Speaker B” into meaningful identifiers that you provide. The fastest model for real-time English transcription. Optimized for speed and cost-effectiveness for English-only applications.	$0.15/hr
Universal-Streaming Multilingual Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations. Speaker Identification allows you to identify speakers by their actual names or roles, transforming generic labels like “Speaker A” or “Speaker B” into meaningful identifiers that you provide. Multilingual transcription at the speed and cost of Universal-Streaming. Supports English, Spanish, German, French, Portuguese, and Italian.	$0.15/hr
Whisper-Streaming Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations. Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations. Speaker Identification allows you to identify speakers by their actual names or roles, transforming generic labels like “Speaker A” or “Speaker B” into meaningful identifiers that you provide. Open-source Whisper model enhanced with AssemblyAI's reliable infrastructure and unlimited scale. Supports 99+ languages at an accessible price point.	$0.30/hr

Add-on features	Universal-3 Pro Streaming	Universal-Streaming
Keyterms Prompting Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations. Provide up to 100 words or phrases (maximum 6 words per phrase) to improve transcription accuracy.	Included	$0.04/hr
Speaker Diarization Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations. Detect multiple speakers in audio files and segment the transcript into utterances, showing what each speaker said.	$0.12/hr	$0.12/hr
Prompting Beta Control transcription behavior with plain language instructions: provide context, tag audio events, and more.	$0.05/hr	Not supported
Medical Mode Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations. New Optimize transcription for medical terminology and healthcare conversations with significantly improved accuracy.	$0.15/hr	$0.15/hr

Ready to start building with voice data?

Get started today with $50 in free credits. No credit card required.

Get your API key

Speech Understanding

AI models that extract meaning from your transcripts. Identify speakers by name, detect sentiment, surface topics, generate summaries, and more.

Models	Pay as you go	Custom
Speaker Identification Speaker Identification allows you to identify speakers by their actual names or roles, transforming generic labels like “Speaker A” or “Speaker B” into meaningful identifiers that you provide. Identify speakers by their actual names or roles	$0.02/hr	Get custom rate limits, enhanced concurrency, and enterprise-grade flexibility tailored to your AI workloads Contact us
Translation The Translation feature automatically converts your transcribed audio content from one language to another, enabling you to reach global audiences without manual translation work. Convert your content from one language to another	$0.06/hr
Custom Formatting The Custom Formatting feature automatically standardizes and formats specific types of information in your transcripts, ensuring consistency across dates, phone numbers, emails, and other data types. Standardize and format specific types of information	$0.03/hr
Entity Detection Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations. Identify entities that are spoken, such as names or email addresses	$0.08/hr
Sentiment Analysis With Sentiment Analysis, AssemblyAI can detect the sentiment of each sentence of speech spoken in your audio files. Detect the sentiment of each sentence spoken	$0.02/hr
Auto Chapters Automatically generate a summary over time for audio and video files. Generate a summary over time for audio and video files	$0.08/hr
Key Phrases Accurately identify significant words and phrases in your transcription, enabling you to extract the most pertinent concepts or highlights from your audio/video file. Identify significant words and phrases	$0.01/hr
Topic Detection Label the topics that are spoken in your audio and video files. The predicted topic labels follow the standardized IAB Taxonomy, which makes them suitable for contextual targeting. Label the topics spoken in standardized IAB taxonomy	$0.15/hr
Summarization Leverage our AI-powered Summarization models to automatically summarize audio/video data in your products at scale. Customize the summary types to best fit your use case. Generate a summary of audio files at scale	$0.03/hr

Guardrails

Guardrails ensures only high-quality, safe, and compliant content flows through your applications.

Models	Pay as you go	Custom
Profanity Filtering Automatically filter out profanity from your transcripts. Filter out profanity from your transcripts	$0.01/hr	Get custom rate limits, enhanced concurrency, and enterprise-grade flexibility tailored to your AI workloads Contact us
PII Audio Redaction Identify and remove Personally Identifiable Information, such as phone numbers and social security numbers, from the audio file before it is returned to you. Identify and remove PII from the audio file before it is returned to you	$0.05/hr
PII Text Redaction Identify and remove Personally Identifiable Information, such as phone numbers and social security numbers, from the transcription text before it is returned to you. Identify and remove PII from the transcription text before it is returned to you	$0.08/hr
Content Moderation Detect sensitive content in your audio and video files - such as hate speech, violence, sensitive social issues, alcohol, drugs, and more. Detect sensitive content in your audio and video files	$0.15/hr

Building for healthcare or finance?

Our compliance-ready plans include HIPAA BAAs, SOC 2 Type II audit reports, and dedicated data processing agreements.

Talk to our team

LLM Gateway

Apply powerful language models directly to your audio data through a single API. Ask questions, generate insights, and build custom workflows all without managing LLM infrastructure.

Models	Input	Output	Custom
GPT-5.2 Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$1.75 / 1M	$14.00 / 1M	Get custom rate limits, enhanced concurrency, and enterprise-grade flexibility tailored to your AI workloads Contact us
GPT-5.1 Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$1.25 / 1M	$10.00 / 1M
Claude 4.6 Sonnet Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$3.00 / 1M	$15.00 / 1M
Claude 4.5 Sonnet Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$3.00 / 1M	$15.00 / 1M
Gemini 3 Pro Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$2.00 / 1M	$12.00 / 1M

Models	Input	Output	Custom
GPT-5.2 Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$1.75 / 1M	$14.00 / 1M	Get custom rate limits, enhanced concurrency, and enterprise-grade flexibility tailored to your AI workloads Contact us
GPT-5.1 Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$1.25 / 1M	$10.00 / 1M
GPT-5 Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$1.25 / 1M	$10.00 / 1M
GPT-5-Mini Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$0.25 / 1M	$2.00 / 1M
GPT-5 Nano Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$0.05 / 1M	$0.40 / 1M
GPT 4.1 Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$2.00 / 1M	$8.00 / 1M
gpt-oss-20b Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$0.07 / 1M	$0.30 / 1M
gpt-oss-120b Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$0.15 / 1M	$0.60 / 1M
ChatGPT-4o Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$5.00 / 1M	$15.00 / 1M

Models	Input	Output	Custom
Claude 4.6 Sonnet Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$3.00 / 1M	$15.00 / 1M	Get custom rate limits, enhanced concurrency, and enterprise-grade flexibility tailored to your AI workloads Contact us
Claude 4.5 Sonnet Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$3.00 / 1M	$15.00 / 1M
Claude 4.5 Haiku Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$1.00 / 1M	$5.00 / 1M
Claude 4 Sonnet Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$3.00 / 1M	$15.00 / 1M
Claude 4.6 Opus Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$5.00 / 1M	$25.00 / 1M
Claude 4.5 Opus Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$5.00 / 1M	$25.00 / 1M
Claude 4 Opus Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$15.00 / 1M	$75.00 / 1M
Claude 3.5 Haiku Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$0.80 / 1M	$4.00 / 1M
Claude 3 Haiku Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$0.25 / 1M	$1.25 / 1M

Models	Input	Output	Custom
Gemini 3 Pro Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$2.00 / 1M	$12.00 / 1M	Get custom rate limits, enhanced concurrency, and enterprise-grade flexibility tailored to your AI workloads Contact us
Gemini 3 Flash Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$0.50 / 1M	$3.00 / 1M
Gemini 2.5 Flash Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$0.30 / 1M	$2.50 / 1M
Gemini 2.5 Flash Lite Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$0.10 / 1M	$0.40 / 1M
Gemini 2.5 Pro Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$1.25 / 1M	$10.00 / 1M

Models	Input	Output	Custom
Qwen3 Next 80B A3B Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$0.15 / 1M	$1.20 / 1M	Get custom rate limits, enhanced concurrency, and enterprise-grade flexibility tailored to your AI workloads Contact us
Qwen3 32B Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$0.15 / 1M	$0.60 / 1M
Kimi K2.5 Model with superior performance on complex reasoning tasks, advanced creative work, and sophisticated problem-solving.	$0.60 / 1M	$3.00 / 1M

Security and Privacy

AssemblyAI uses enterprise-grade security practices to keep your data safe. We approach security by design and default, and continuously ensure AssemblyAI is secure for you and your team.

GDPR

PCI DSS

SOC 2 Type 2

EU Data Residency

ISO 27001

HIPAA Compliance

Frequently Asked Questions

What are the differences between Speech-to-Text models?

Universal-3 Pro is our most advanced speech language model, designed specifically for speech tasks. It uses a prompt-based architecture for deeper contextual understanding and allows domain-specific customization—no retraining needed. Universal-2 is a high-accuracy model supporting 99 languages, built for general-purpose use cases. It offers strong out-of-the-box performance and supports features like speaker diarization and real-time streaming. Perfect for legal, medical, and other specialized use cases. Universal-Streaming is an ultra-fast, ultra-accurate streaming speech-to-text model designed for voice agents.

Can I sign up for free?

Yes! With the free offer, you get $50 in credits to use towards AssemblyAI’s Speech-to-Text APIs. To add more credits, simply add a credit card to your account.

Do you offer volume discounts?

Absolutely! If you plan to send large volumes of audio and video content through our API, please reach out to us here to see if you qualify for a volume discount.

How does Universal-Streaming concurrency work?

We don't limit how many streams you can run simultaneously - only how quickly you can start new ones, giving you unlimited scale while ensuring reliable performance.
‍
Free users can start 5 new streams per minute, while pay-as-you-go accounts start with 100 new streams per minute. Anytime you are using 70% or more of your current limit, your new sessions rate limit will automatically increase and scale up by 10% every 60 seconds. This means within 5 minutes of sustained usage, you can scale from 100 to 146 new streams per minute (for a total of 610 concurrent streams), with unlimited ceiling as your usage grows.

These limits are designed to never interfere with legitimate applications - normal scaling patterns automatically get more capacity before hitting any walls, while protecting against runaway scripts or abuse. Your baseline limit is guaranteed and never decreases, so you can scale smoothly from dozens to thousands of simultaneous streams without artificial barriers or surprise fees.

Need higher limits? Contact our sales team for custom limits that match your deployment timeline.

How does Universal-Streaming session-based pricing work?

We charge based on total session duration - the entire time your connection stays open, whether audio is flowing or not. This gives you complete transparency and control: you pay for exactly what you're using, with no hidden costs for idle streams. You can choose to keep streams open continuously for instant response or open them strategically as needed to minimize costs, scaling up and down without prepaid commitments based on how your voice application actually works.

How fast does it take for audio and video files to process?

Most audio files sent to AssemblyAI's API can be processed in less than 60 seconds. For example, you ca process a 30 minute pre-recorded audio file in 23 seconds with Universal speech-to-text model.

How does billing work?

Great question. Once you add a credit card and deposit funds into your account, your account's funds will be drained as you use the API.

How is multichannel billed?

When multichannel is enabled, each channel will be transcribed and billed separately. The total cost is calculated by taking the hourly transcription rate (billed per second) and multiplying it by the number of channels. To calculate your total cost, simply multiply your recording's duration by the hourly rate, then multiply that result by the number of channels.

For example, if you sent a 5-minute recording with three channels, you would be billed for the 5 minutes of audio multiplied by the standard rate, with that total multiplied by three channels. This is equivalent to being billed for 15 minutes of audio.

Can I purchase or use AssemblyAI through the AWS Marketplace?

You can also get started with AssemblyAI on the AWS Marketplace—or ask your AWS account team about how to leverage AssemblyAI to revolutionize the way your company understands its customers.

How can I talk to someone?

Feel free to email us at support@assemblyai.com, or click the chat button in the bottom right corner of your browser to chat live with our API Support team!

What languages do you support?

We support over 99 languages and counting, including Global English (English and all of its accents).

What is a token?

In the context of a Large Language Model (LLM), a “token” is the smallest unit of text processed by the model. 100 tokens roughly maps to ~75 words.

Unlock the value of voice data

Build what’s next on the platform powering thousands of the industry’s leading of Voice AI apps.

Try our API for free Contact sales