Simple transparent pricing

Free

Start building with $50 of free credits

For developers looking to prototype with Speech AI

Access to industry-leading Speech-to-Text and Audio Intelligence models
Transcribe up to 185 hours of pre-recorded audio for free
Transcribe up to 333 hours of streaming audio for free
Up to 5 new streams per minute
Get tips and support as you build from developer docs and community resources

Start building for free

Pay as you go

Custom

The most flexible plan for scaling AI in production

For teams and organizations building AI products at scale

Flexible, zero-obligation pricing that scales to millions of hours
Unlimited concurrent streams
Customize rate limits - scale to any workload
Dedicated technical support with response time under one hour
Customized SLAs and SLOs
BAA for HIPAA Compliance
Compliance with EU Data Residency standards
Self-hosted deployments (On-prem, VPC) (coming soon!)
Early access to new models and model improvements
Available through AWS Marketplace

See full pricing details

Pre-recorded Speech-to-Text

Build on top of the most accurate Speech-to-Text model on the market with >93% accuracy

Models	Free Start for free	Pay as you go Build your plan	Custom Talk to us
Slam-1 Highest accuracy for English with fine-tuning support and customization via prompting	Free up to $50	$0.27 /hr	Lower rates based on volume
Universal Fast, lightweight Speech AI at an accessible price point Best for out-of-the-box transcription with excellent accuracy and low latency	Free up to $50	$0.27 /hr	Lower rates based on volume
Features
Speaker Diarization Automatically detect the number of speakers in your audio file, and each word in the transcription text can be associated with its speaker.
Automatic Language Detection Automatically detect if the dominant language of the spoken audio is supported by our API and route it to the appropriate model for transcription.
Profanity Filtering Automatically detect and replace profanity in the transcription text.
Custom Vocabulary Only available in Nano. Boost accuracy for vocabulary that is unique or custom to your specific use case or product.
Keyterm Prompting Only available in our Slam-1 model. Provide up to 1000 domain-specific words or phrases that may appear in your audio. New
Multichannel Transcribe audio files with multiple speakers separately.
Filler Word Filtering Optionally include disfluencies in the transcripts of your audio files.
Custom Spelling Specify how you would like certain words to be spelled or formatted in the transcription text.
Word Timestamps Word-by-word timestamps across the entire transcript text.
Auto Punctuation and Casing Automatically add casing and punctuation of proper nouns to the transcription text.
ITN/Formatting Automatically convert spoken form text into its proper written format to increase transcript readability.
Confidence Scores Get a confidence score for each word in the transcript.
Word Search Search through a completed transcript for a specific set of keywords, which is useful for quickly finding relevant information.
Export SRT/VTT Captions Export completed transcripts in SRT or VTT format, which can be used for subtitles and closed captions in videos.
Export Paragraphs/Sentences Retrieve transcripts that are automatically segmented into paragraphs or sentences, for a more reader-friendly experience.

See all features

Streaming Speech-to-Text

Transcribe live audio and video files in real-time at ultra-low latency and high-quality accuracy

Model	Free Start for free	Pay as you go Build your plan	Custom Talk to us
Universal-Streaming Ultra-fast, ultra-accurate, built-in turn detection, and unlimited concurrency	Free up to $50	$0.15 /hr	Lower rates based on volume
Features
Auto Punctuation and Casing Automatically add casing and punctuation of proper nouns to the transcription text.
End of Turn Detection Combines acoustic and semantic features with traditional silence detection for faster, more accurate end-of-turn detection with reliable fallback,
ITN/Formatting Automatically convert spoken form text into its proper written format to increase transcript readability.

Speech Understanding

Gain maximum value from voice data with audio intelligence models, and leverage LLM capabilities with LeMUR

LeMUR Models	Free Start for free	Pay as you go Build your plan	Custom Talk to us
Claude 3.7 Sonnet The newest and most advanced model featuring enhanced reasoning capabilities. Strong at complex reasoning tasks.		$0.003 / 1k tokens (Input) $0.015 / 1k tokens (Output)	$0.003 / 1k tokens (Input) $0.015 / 1k tokens (Output)
Claude 3.5 Sonnet A mid-tier upgrade balancing power and performance.		$0.003 / 1k tokens (Input) $0.015 / 1k tokens (Output)	$0.003 / 1k tokens (Input) $0.015 / 1k tokens (Output)
Claude 3.5 Haiku The fastest model in the family, optimized for quick responses while maintaining good reasoning.		$0.0008 / 1k tokens (Input) $0.004 / 1k tokens (Output)	$0.0008 / 1k tokens (Input) $0.004 / 1k tokens (Output)
Claude 3 Opus Claude 3 Opus is good at handling complex analysis, longer tasks with many steps, and higher-order math and coding tasks.		$0.015 / 1k tokens (Input) $0.075 / 1k tokens (Output)	$0.015 / 1k tokens (Input) $0.075 / 1k tokens (Output)
Claude 3 Haiku Claude 3 Haiku is the fastest model that can execute lightweight actions.		$0.00025 / 1k tokens (Input) $0.00125 / 1k tokens (Output)	$0.00025 / 1k tokens (Input) $0.00125 / 1k tokens (Output)
Claude 3 Sonnet Claude 3 Sonnet is a legacy model with a balanced combination of performance and speed for efficient, high-throughput tasks.		$0.003 / 1k tokens (Input) $0.015 / 1k tokens (Output)	$0.003 / 1k tokens (Input) $0.015 / 1k tokens (Output)
Audio Intelligence Features
Entity Detection Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.	Included in free credits	$0.08 /hr	Lower rates based on volume
Topic Detection Label the topics that are spoken in your audio and video files. The predicted topic labels follow the standardized IAB Taxonomy, which makes them suitable for contextual targeting.	Included in free credits	$0.15 /hr	Lower rates based on volume
Key Phrases Accurately identify significant words and phrases in your transcription, enabling you to extract the most pertinent concepts or highlights from your audio/video file.	Included in free credits	$0.01 /hr	Lower rates based on volume
PII Audio Redaction	Included in free credits	$0.05 /hr	Lower rates based on volume
PII Redaction Identify and remove Personally Identifiable Information, such as phone numbers and social security numbers, from the transcription text before it is returned to you.	Included in free credits	$0.08 /hr	Lower rates based on volume
Sentiment Analysis With Sentiment Analysis, AssemblyAI can detect the sentiment of each sentence of speech spoken in your audio files.	Included in free credits	$0.02 /hr	Lower rates based on volume
Content Moderation Detect sensitive content in your audio and video files - such as hate speech, violence, sensitive social issues, alcohol, drugs, and more.	Included in free credits	$0.15 /hr	Lower rates based on volume
Auto Chapters Automatically generate a summary over time for audio and video files.	Included in free credits	$0.08 /hr	Lower rates based on volume
Summarization Leverage our AI-powered Summarization models to automatically summarize audio/video data in your products at scale. Customize the summary types to best fit your use case.	Included in free credits	$0.03 /hr	Lower rates based on volume

Rate Limits

	Free Start for free	Pay as you go Build your plan	Custom Talk to us
Hours of pre-recorded audio	Up to 185 hours	Unlimited	Unlimited
Hours of streaming audio	Up to 333 hours	Unlimited	Unlimited
Pre-recorded concurrency	5 files	Starts at 200 files	Unlimited
Streaming concurrency	5 new streams per minute	Starts at 128 streams per minute	Unlimited

Security and Privacy

	Free Start for free	Pay as you go Build your plan	Custom Talk to us
GDPR
PCI-DSS
SOC 2 Type 1/Type 2
EU Data Residency
ISO 27001
BAA for HIPAA Compliance

Frequently Asked Questions

What are the differences between Speech-to-Text models?

Universal is a high-accuracy English model built for general-purpose use cases. It offers strong out-of-the-box performance and supports features like speaker diarization and real-time streaming. Slam-1 is our most advanced speech language model, designed specifically for speech tasks. It uses a prompt-based architecture for deeper contextual understanding and allows domain-specific customization—no retraining needed. Perfect for legal, medical, and other specialized use cases. Universal-Streaming is an ultra-fast, ultra-accurate streaming speech-to-text model designed for voice agents.

Can I sign up for free?

Yes! With the free offer, you get $50 in credits to use towards AssemblyAI’s Speech-to-Text APIs. To add more credits, simply add a credit card to your account.

Do you offer volume discounts?

Absolutely! If you plan to send large volumes of audio and video content through our API, please reach out to us here to see if you qualify for a volume discount.

How does Universal-Streaming concurrency work?

We don't limit how many streams you can run simultaneously - only how quickly you can start new ones, giving you unlimited scale while ensuring reliable performance.
‍
Free users can start 5 new streams per minute, while pay-as-you-go accounts start with 128 new streams per minute that automatically grows by 10% each minute you're at capacity. This means within 5 minutes of sustained usage, you can scale from 128 to 776+ new streams per minute, with unlimited ceiling as your usage grows.

These limits are designed to never interfere with legitimate applications - normal scaling patterns automatically get more capacity before hitting any walls, while protecting against runaway scripts or abuse. Your baseline limit is guaranteed and never decreases, so you can scale smoothly from dozens to thousands of simultaneous streams without artificial barriers or surprise fees.

Need higher limits? Contact our sales team for custom limits that match your deployment timeline.

How does Universal-Streaming session-based pricing work?

We charge based on total session duration - the entire time your connection stays open, whether audio is flowing or not. This gives you complete transparency and control: you pay for exactly what you're using, with no hidden costs for idle streams. You can choose to keep streams open continuously for instant response or open them strategically as needed to minimize costs, scaling up and down without prepaid commitments based on how your voice application actually works.

How fast does it take for audio and video files to process?

Most audio files sent to AssemblyAI's API can be processed in less than 60 seconds. For example, you ca process a 30 minute pre-recorded audio file in 23 seconds with Universal speech-to-text model.

How does billing work?

Great question. Once you add a credit card and deposit funds into your account, your account's funds will be drained as you use the API.

How is multichannel billed?

When multichannel is enabled, each channel will be transcribed and billed separately. The total cost is calculated by taking the hourly transcription rate (billed per second) and multiplying it by the number of channels. To calculate your total cost, simply multiply your recording's duration by the hourly rate, then multiply that result by the number of channels.

For example, if you sent a 5-minute recording with three channels, you would be billed for the 5 minutes of audio multiplied by the standard rate, with that total multiplied by three channels. This is equivalent to being billed for 15 minutes of audio.

Can I purchase or use AssemblyAI through the AWS Marketplace?

You can also get started with AssemblyAI on the AWS Marketplace—or ask your AWS account team about how to leverage AssemblyAI to revolutionize the way your company understands its customers.

How can I talk to someone?

Feel free to email us at support@assemblyai.com, or click the chat button in the bottom right corner of your browser to chat live with our API Support team!

What languages do you support?

We support over 99 languages and counting, including Global English (English and all of its accents).

What is a token?

In the context of a Large Language Model (LLM), a “token” is the smallest unit of text processed by the model. 100 tokens roughly maps to ~75 words.

Turn voice data into unparalleled product experiences

Partner with the leader in Speech AI to build powerful products with breakthrough industry impact.

Try our API for free Contact sales