Simple transparent pricing
Free
Start building with $50 of free credits
- Access to industry-leading Speech-to-Text and Audio Intelligence models
- Transcribe up to 185 hours of pre-recorded audio for free
- Transcribe up to 333 hours of streaming audio for free
- Up to 5 new streams per minute
- Get tips and support as you build from developer docs and community resources
Pay as you go
Start as low as $0.15/hr for Streaming Speech-to-Text
- Unlimited access to Speech-to-Text, Audio Intelligence, and LeMUR
- Start at 128 concurrent streams with a 10% ramp per minute
- Technical support via live chat and email
- Pre-recorded concurrency starting at 200 files
- Technical support via live chat and email
Custom
The most flexible plan for scaling AI in production
- Flexible, zero-obligation pricing that scales to millions of hours
- Unlimited concurrent streams
- Customize rate limits - scale to any workload
- Dedicated technical support with response time under one hour
- Customized SLAs and SLOs
- BAA for HIPAA Compliance
- Compliance with EU Data Residency standards
- Self-hosted deployments (On-prem, VPC) (coming soon!)
- Early access to new models and model improvements
- Available through AWS Marketplace
Pre-recorded Speech-to-Text
Build on top of the most accurate Speech-to-Text model on the market with >93% accuracy
Models | |||
---|---|---|---|
Slam-1 Highest accuracy for English with fine-tuning support and customization via prompting | Free up to $50 | $0.27 /hr | Lower rates based on volume |
Universal Fast, lightweight Speech AI at an accessible price point Best for out-of-the-box transcription with excellent accuracy and low latency | Free up to $50 | $0.27 /hr | Lower rates based on volume |
Features | |||
Speaker Diarization Automatically detect the number of speakers in your audio file, and each word in the transcription text can be associated with its speaker. | |||
Automatic Language Detection Automatically detect if the dominant language of the spoken audio is supported by our API and route it to the appropriate model for transcription. | |||
Profanity Filtering Automatically detect and replace profanity in the transcription text. | |||
Custom Vocabulary Only available in Nano. Boost accuracy for vocabulary that is unique or custom to your specific use case or product. | |||
Keyterm Prompting Only available in our Slam-1 model. Provide up to 1000 domain-specific words or phrases that may appear in your audio. New | |||
Multichannel Transcribe audio files with multiple speakers separately. | |||
Filler Word Filtering Optionally include disfluencies in the transcripts of your audio files. | |||
Custom Spelling Specify how you would like certain words to be spelled or formatted in the transcription text. | |||
Word Timestamps Word-by-word timestamps across the entire transcript text. | |||
Auto Punctuation and Casing Automatically add casing and punctuation of proper nouns to the transcription text. |
Streaming Speech-to-Text
Transcribe live audio and video files in real-time at ultra-low latency and high-quality accuracy
Model | |||
---|---|---|---|
Universal-Streaming Ultra-fast, ultra-accurate, built-in turn detection, and unlimited concurrency | Free up to $50 | $0.15 /hr | Lower rates based on volume |
Features | |||
Auto Punctuation and Casing Automatically add casing and punctuation of proper nouns to the transcription text. | |||
End of Turn Detection Combines acoustic and semantic features with traditional silence detection for faster, more accurate end-of-turn detection with reliable fallback, | |||
ITN/Formatting Automatically convert spoken form text into its proper written format to increase transcript readability. |
Speech Understanding
Gain maximum value from voice data with audio intelligence models, and leverage LLM capabilities with LeMUR
LeMUR Models | |||
---|---|---|---|
Claude 3.7 Sonnet The newest and most advanced model featuring enhanced reasoning capabilities. Strong at complex reasoning tasks. | $0.003 / 1k tokens (Input) $0.015 / 1k tokens (Output) | $0.003 / 1k tokens (Input) $0.015 / 1k tokens (Output) | |
Claude 3.5 Sonnet A mid-tier upgrade balancing power and performance. | $0.003 / 1k tokens (Input) $0.015 / 1k tokens (Output) | $0.003 / 1k tokens (Input) $0.015 / 1k tokens (Output) | |
Claude 3.5 Haiku The fastest model in the family, optimized for quick responses while maintaining good reasoning. | $0.0008 / 1k tokens (Input) $0.004 / 1k tokens (Output) | $0.0008 / 1k tokens (Input) $0.004 / 1k tokens (Output) | |
Claude 3 Opus Claude 3 Opus is good at handling complex analysis, longer tasks with many steps, and higher-order math and coding tasks. | $0.015 / 1k tokens (Input) $0.075 / 1k tokens (Output) | $0.015 / 1k tokens (Input) $0.075 / 1k tokens (Output) | |
Claude 3 Haiku Claude 3 Haiku is the fastest model that can execute lightweight actions. | $0.00025 / 1k tokens (Input) $0.00125 / 1k tokens (Output) | $0.00025 / 1k tokens (Input) $0.00125 / 1k tokens (Output) | |
Claude 3 Sonnet Claude 3 Sonnet is a legacy model with a balanced combination of performance and speed for efficient, high-throughput tasks. | $0.003 / 1k tokens (Input) $0.015 / 1k tokens (Output) | $0.003 / 1k tokens (Input) $0.015 / 1k tokens (Output) | |
Audio Intelligence Features | |||
Entity Detection Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations. | Included in free credits | $0.08 /hr | Lower rates based on volume |
Topic Detection Label the topics that are spoken in your audio and video files. The predicted topic labels follow the standardized IAB Taxonomy, which makes them suitable for contextual targeting. | Included in free credits | $0.15 /hr | Lower rates based on volume |
Key Phrases Accurately identify significant words and phrases in your transcription, enabling you to extract the most pertinent concepts or highlights from your audio/video file. | Included in free credits | $0.01 /hr | Lower rates based on volume |
PII Audio Redaction | Included in free credits | $0.05 /hr | Lower rates based on volume |
PII Redaction Identify and remove Personally Identifiable Information, such as phone numbers and social security numbers, from the transcription text before it is returned to you. | Included in free credits | $0.08 /hr | Lower rates based on volume |
Sentiment Analysis With Sentiment Analysis, AssemblyAI can detect the sentiment of each sentence of speech spoken in your audio files. | Included in free credits | $0.02 /hr | Lower rates based on volume |
Content Moderation Detect sensitive content in your audio and video files - such as hate speech, violence, sensitive social issues, alcohol, drugs, and more. | Included in free credits | $0.15 /hr | Lower rates based on volume |
Auto Chapters Automatically generate a summary over time for audio and video files. | Included in free credits | $0.08 /hr | Lower rates based on volume |
Summarization Leverage our AI-powered Summarization models to automatically summarize audio/video data in your products at scale. Customize the summary types to best fit your use case. | Included in free credits | $0.03 /hr | Lower rates based on volume |
Rate Limits
Security and Privacy
Frequently Asked Questions
Universal is a high-accuracy English model built for general-purpose use cases. It offers strong out-of-the-box performance and supports features like speaker diarization and real-time streaming. Slam-1 is our most advanced speech language model, designed specifically for speech tasks. It uses a prompt-based architecture for deeper contextual understanding and allows domain-specific customization—no retraining needed. Perfect for legal, medical, and other specialized use cases. Universal-Streaming is an ultra-fast, ultra-accurate streaming speech-to-text model designed for voice agents.
Yes! With the free offer, you get $50 in credits to use towards AssemblyAI’s Speech-to-Text APIs. To add more credits, simply add a credit card to your account.
Absolutely! If you plan to send large volumes of audio and video content through our API, please reach out to us here to see if you qualify for a volume discount.
We don't limit how many streams you can run simultaneously - only how quickly you can start new ones, giving you unlimited scale while ensuring reliable performance.
Free users can start 5 new streams per minute, while pay-as-you-go accounts start with 128 new streams per minute that automatically grows by 10% each minute you're at capacity. This means within 5 minutes of sustained usage, you can scale from 128 to 776+ new streams per minute, with unlimited ceiling as your usage grows.
These limits are designed to never interfere with legitimate applications - normal scaling patterns automatically get more capacity before hitting any walls, while protecting against runaway scripts or abuse. Your baseline limit is guaranteed and never decreases, so you can scale smoothly from dozens to thousands of simultaneous streams without artificial barriers or surprise fees.
Need higher limits? Contact our sales team for custom limits that match your deployment timeline.
We charge based on total session duration - the entire time your connection stays open, whether audio is flowing or not. This gives you complete transparency and control: you pay for exactly what you're using, with no hidden costs for idle streams. You can choose to keep streams open continuously for instant response or open them strategically as needed to minimize costs, scaling up and down without prepaid commitments based on how your voice application actually works.
Most audio files sent to AssemblyAI's API can be processed in less than 60 seconds. For example, you ca process a 30 minute pre-recorded audio file in 23 seconds with Universal speech-to-text model.
Great question. Once you add a credit card and deposit funds into your account, your account's funds will be drained as you use the API.
When multichannel is enabled, each channel will be transcribed and billed separately. The total cost is calculated by taking the hourly transcription rate (billed per second) and multiplying it by the number of channels. To calculate your total cost, simply multiply your recording's duration by the hourly rate, then multiply that result by the number of channels.
For example, if you sent a 5-minute recording with three channels, you would be billed for the 5 minutes of audio multiplied by the standard rate, with that total multiplied by three channels. This is equivalent to being billed for 15 minutes of audio.
You can also get started with AssemblyAI on the AWS Marketplace—or ask your AWS account team about how to leverage AssemblyAI to revolutionize the way your company understands its customers.
Feel free to email us at support@assemblyai.com, or click the chat button in the bottom right corner of your browser to chat live with our API Support team!
We support over 99 languages and counting, including Global English (English and all of its accents).
In the context of a Large Language Model (LLM), a “token” is the smallest unit of text processed by the model. 100 tokens roughly maps to ~75 words.
Turn voice data into unparalleled product experiences
Partner with the leader in Speech AI to build powerful products with breakthrough industry impact.
