Speech-to-text API pricing guide: Per-minute, per-hour and feature costs explained
Speech to text API pricing explained with per-minute, per-hour, feature, and hidden cost comparisons to help you choose the right provider for your needs.



Speech-to-text API pricing extends far beyond simple per-minute rates.
Providers use different billing methods, feature bundles, and accuracy tiers that can dramatically change your final costs. Understanding these pricing mechanics helps you compare options accurately and avoid surprise charges when your usage scales.
Choosing the wrong pricing model can cost you 30-40% more than necessary, especially for applications processing many short audio clips. This guide breaks down how speech-to-text APIs calculate costs, compares major providers across key features, and shows you how to calculate total cost of ownership for your specific use case. You'll learn to evaluate providers based on your accuracy requirements, feature needs, and compliance constraints rather than headline rates alone.
How speech-to-text APIs charge for usage
Speech-to-text APIs calculate your bill using several key variables that can dramatically change your final cost. The most important factor is whether a provider uses bundled or unbundled pricing—some charge separately for each feature while others include everything in one rate.
Understanding these billing mechanics helps you compare providers accurately. Without this knowledge, you might choose based on a low advertised rate only to discover hidden charges when your usage scales.
Per-second vs per-minute vs block-based billing
The billing unit determines how much you pay for short audio clips. Per-second billing charges for exact usage. Per-minute billing rounds up to the next full minute. Block-based billing rounds up to 15-second increments.
Here's what an 11-second audio clip costs with different billing methods:
- Per-second billing: You pay for exactly 11 seconds
- Per-minute billing: You pay for 60 seconds (445% overhead)
- 15-second blocks: You pay for 15 seconds (36% overhead)
This overhead compounds quickly.
A contact center processing thousands of brief customer interactions could pay 30-40% more with block-based billing compared to per-second pricing.
Streaming vs batch processing costs
Real-time streaming transcription costs significantly more than batch processing. This premium reflects the infrastructure needed for sub-300ms latency. AssemblyAI's Universal-3 Pro model costs around $0.21 per hour for batch processing but $0.45 per hour for streaming.
Voice agents and real-time applications have no choice—they must use streaming rates. Batch processing works for podcasts or meeting notes where you can wait minutes for results. The choice isn't just about cost but whether your application can tolerate any delay.
Standard vs premium model pricing
Every provider offers multiple model tiers with different accuracy levels and prices. Standard models handle general transcription adequately. Premium models excel at proper nouns, alphanumerics, and specialized terminology.
The premium often pays for itself if you display transcripts to users or need them for compliance. But if you're feeding transcripts to an LLM that tolerates errors, standard models might suffice.
Medical transcription shows the biggest pricing variations—from $0.21 per hour with specialized models to $4-5 per hour with generic premium options.
Bundled vs unbundled feature pricing
Two pricing philosophies dominate the market. Providers like AssemblyAI and Deepgram price features as add-ons. Providers like Gladia bundle features into a single rate.
Unbundled example:
- Base transcription: $0.15/hour
- Speaker diarization: +$0.02/hour
- Sentiment analysis: +$0.01/hour
- PII redaction: +$0.02/hour
- Total: $0.20/hour
Bundled example:
- All features included: $0.35/hour flat rate
Teams using only transcription save money with unbundled models. Teams needing multiple features might find bundled pricing simpler and potentially cheaper.
Speech-to-text API pricing comparison
Comparing providers requires looking beyond headline rates to understand what's included and what costs extra. Data privacy requirements can eliminate many options entirely—EU data residency or zero-retention policies narrow your choices significantly.
Several patterns emerge from this comparison:
- Cloud platform overhead: Google and AWS require additional infrastructure services adding costs
- Feature bundling complexity: Direct price comparison becomes difficult when features are bundled differently
- Volume discounts: Most require sales contact with tiers typically at 50,000+ hours monthly
The cheapest advertised rate rarely reflects your actual cost once you add required features and infrastructure.
Calculating total cost of ownership
The cheapest provider isn't always the most cost-effective choice. Your total cost depends on accuracy requirements, feature usage, and infrastructure overhead. A developer building internal tools has different needs than a team displaying transcripts to customers.
Cost scenarios by use case
Different applications demand different accuracy levels and features, dramatically affecting total costs. Here's how requirements drive provider selection:
Startup MVP building a podcast search app: Processing 1,000 hours monthly for basic search functionality. Accuracy matters less since search can handle some errors.
- Recommended: OpenAI Whisper or AssemblyAI Universal-2 for cost efficiency
- Avoid: Premium models with features you don't need
Enterprise contact center with compliance requirements: Processing 10,000 hours monthly where transcripts are reviewed and stored for regulatory purposes.
- Recommended: AssemblyAI Universal-3 Pro for accuracy and included diarization
- Avoid: Cloud platforms that add infrastructure complexity and costs
Healthcare practice transcribing patient consultations: Processing 5,000 hours monthly requiring medical terminology accuracy and regulatory compliance.
- Recommended: AssemblyAI Medical Mode for specialized terminology and Business Associate Agreement
- Avoid: Generic models that struggle with medical terms
High-volume batch processing for LLM training: Processing 50,000+ hours monthly where transcripts feed AI models but aren't displayed to users.
- Recommended: OpenAI Whisper if accuracy trade-offs are acceptable
- Consider: Volume discounts from major providers
Hidden costs beyond per-minute pricing
Cloud platform providers add infrastructure overhead that doesn't appear in headline pricing.
Google Cloud Speech-to-Text requires Cloud Storage, Cloud Functions, Pub/Sub messaging, and data egress fees. AWS Transcribe needs S3 storage, Lambda functions, and SQS queues.
These services typically add costs:
- Cloud Storage: $20-50 monthly
- Compute functions: $30-80 monthly
- Message queues: $10-30 monthly
- Data egress: $40-100 monthly
API-first providers eliminate these dependencies—you send audio and receive transcripts without managing infrastructure.
Engineering time represents another hidden cost. Setting up production pipelines on cloud platforms takes 20-40 hours for unfamiliar teams. Data residency compliance adds more complexity if you need EU hosting or zero-retention guarantees.
How accuracy affects quality assurance costs
Accuracy differences translate directly to labor costs when humans review transcripts. This matters most for displayed transcripts, compliance records, or systems that rely on semantic accuracy.
Consider a support team reviewing transcripts where error rates differ by 3 percentage points. Higher error rates mean more corrections, more review time, and higher labor costs. Some customers report staying with premium providers despite lower-cost alternatives because the relationship overhead alone justified the price difference.
The key insight: cost per hour is falling, but cost per correction determines your true expense.
Choosing the right pricing model for your needs
Selecting the right provider requires matching your specific requirements to available options. Data privacy requirements can eliminate choices before you even consider pricing.
Provider selection criteria:
- AssemblyAI: Best when accuracy matters, transcripts are displayed, or you need healthcare compliance with cost efficiency
- OpenAI Whisper: Ideal for pure LLM input where accuracy trade-offs are acceptable
- Gladia: Good for teams wanting bundled features with EU data residency
- Cloud platforms: Consider only if already using that cloud ecosystem extensively
Common evaluation mistakes:
- Rate shopping: Choosing based on lowest per-minute rate without calculating total cost
- Free tier assumptions: Underestimating limits before production volume kicks in
- Accuracy ignorance: Ignoring accuracy until production when fixing errors becomes expensive
- Feature blindness: Not accounting for required features in cost calculations
- Compliance afterthoughts: Discovering data residency requirements eliminate your chosen provider
The right choice balances cost, accuracy, features, and operational complexity for your specific use case.
Final words
Speech-to-text API pricing has evolved beyond simple per-minute rates to complex models involving feature bundles, accuracy tiers, and infrastructure overhead. The cheapest advertised rate might become the most expensive choice once you factor in correction costs, required features, and engineering time. Your evaluation should focus on total cost of ownership for your specific use case rather than headline comparisons.
AssemblyAI's transparent pricing model includes speaker diarization at no extra cost, offers Medical Mode for healthcare applications requiring specialized terminology, and provides an API-first architecture that eliminates infrastructure dependencies. The platform's focus on accuracy reduces post-processing overhead while Business Associate Agreements enable healthcare compliance—delivering predictable costs without surprise charges common with cloud platform providers.
Frequently asked questions
What does base speech-to-text pricing typically include?
Base pricing covers audio-to-text conversion with basic punctuation and capitalization. Most providers charge extra for features like speaker identification, sentiment analysis, or medical terminology recognition.
How much do speaker diarization features cost?
Speaker diarization costs vary significantly by provider—from included with AssemblyAI to $0.36 per hour with Google Cloud. Bundled providers like Gladia include it in their flat rate while unbundled providers add surcharges.
Can I estimate my monthly costs before signing up?
Yes, calculate your expected monthly audio hours, multiply by the provider's rate, add required feature costs, and include infrastructure overhead for cloud platforms. Test with real audio during free trials to verify the provider meets your accuracy requirements.
What happens if I exceed my free tier allowance?
Most providers automatically switch to paid billing when you exceed limits. Some pause processing until you add payment details. Set up usage monitoring and billing alerts before reaching limits to avoid service interruptions.
When should I choose the cheapest provider available?
Choose cost-optimized providers when transcripts aren't displayed to users, downstream AI processing can tolerate errors, and cost is your primary constraint. Avoid them for customer-facing applications, compliance use cases, or specialized terminology requirements.
Do volume discounts require long-term contracts?
Volume discount structures vary—some providers offer automatic pay-as-you-go tiers while others require annual commitments for meaningful savings. Ask specifically whether discounts apply to committed usage versus actual consumption and how they reset over time.
How do I handle EU data residency requirements?
EU data residency eliminates many providers from consideration. Gladia offers native EU hosting while major cloud providers support regional data processing. Verify compliance requirements early since retrofitting data residency is complex and expensive.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.





