July 15, 2026

Speech-to-text API pricing guide: Per-minute, per-hour and feature costs explained

Speech to text API pricing explained with per-minute, per-hour, feature, and hidden cost comparisons to help you choose the right provider for your needs.

Kelsey Foster

Growth

Speech-to-Text

Reviewed by

Table of contents

[Visible on live site]

Speech-to-text API pricing extends far beyond simple per-minute rates.

Providers use different billing methods, feature bundles, and accuracy tiers that can dramatically change your final costs. Understanding these pricing mechanics helps you compare options accurately and avoid surprise charges when your usage scales.

Choosing the wrong pricing model can cost you 30-40% more than necessary, especially for applications processing many short audio clips. This guide breaks down how speech-to-text APIs calculate costs, compares major providers across key features (with current rates as of July 2026), and shows you how to calculate total cost of ownership for your specific use case. You'll learn to evaluate providers based on your accuracy requirements, feature needs, and compliance constraints rather than headline rates alone.

How speech-to-text APIs charge for usage

Speech-to-text APIs calculate your bill using several key variables that can dramatically change your final cost. The most important factor is whether a provider uses bundled or unbundled pricing — some charge separately for each feature while others include everything in one rate.

Understanding these billing mechanics helps you compare providers accurately. Without this knowledge, you might choose based on a low advertised rate only to discover hidden charges when your usage scales.

Per-second vs per-minute vs block-based billing

The billing unit determines how much you pay for short audio clips. Per-second billing charges for exact usage. Per-minute billing rounds up to the next full minute. Block-based billing rounds up to 15-second increments.

Here's what an 11-second audio clip costs with different billing methods:

Per-second billing: You pay for exactly 11 seconds
Per-minute billing: You pay for 60 seconds (445% overhead)
15-second blocks: You pay for 15 seconds (36% overhead)

This overhead compounds quickly.

A contact center processing thousands of brief customer interactions could pay 30-40% more with block-based billing compared to per-second pricing. AssemblyAI bills per second with no minimums, so you pay for what you actually send.

Streaming, sync, and batch processing costs

Real-time streaming transcription costs more than batch processing. This premium reflects the infrastructure needed for sub-second latency. AssemblyAI's Universal-3.5 Pro costs $0.21 per hour for batch (async) processing, while its streaming counterpart, Universal-3.5 Pro Realtime, is $0.45 per hour base.

There's now a third option for short clips. AssemblyAI launched its Sync API on July 14, 2026 — one HTTP POST returns a finished Universal-3.5 Pro transcript in roughly 134 ms at the median, with no polling, WebSocket, or job to track. It's priced at $0.45/hr, the same as Realtime, because you're paying for latency on short-form audio like dictation and voice commands.

Voice agents and real-time applications have no choice — they must use streaming rates. Batch processing works for podcasts or meeting notes where you can wait minutes for results. The choice isn't just about cost, but whether your application can tolerate any delay. If you're weighing the trade-offs, our guide to batch vs real-time transcription breaks it down.

Standard vs premium model pricing

Every provider offers multiple model tiers with different accuracy levels and prices. Standard models handle general transcription adequately. Premium models excel at proper nouns, alphanumerics, and specialized terminology.

The premium often pays for itself if you display transcripts to users or need them for compliance. But if you're feeding transcripts to an LLM that tolerates errors, standard models might suffice.

Medical transcription shows the biggest pricing variations — from $0.36 per hour with AssemblyAI's Universal-3.5 Pro plus Medical Mode to $4-5 per hour with generic premium options.

Bundled vs unbundled feature pricing

Two pricing philosophies dominate the market. Providers like AssemblyAI and Deepgram price features as add-ons. Providers like Gladia bundle features into a single rate.

Unbundled example (AssemblyAI async):

Base transcription (Universal-2): $0.15/hour
Speaker diarization: +$0.02/hour
Keyterms prompting: +$0.05/hour
Total: $0.22/hour

Bundled example:

All features included: $0.35/hour flat rate

Teams using only transcription save money with unbundled models. Teams needing multiple features might find bundled pricing simpler and potentially cheaper.

Speech-to-text API pricing comparison (as of July 2026)

Comparing providers requires looking beyond headline rates to understand what's included and what costs extra. Data privacy requirements can eliminate many options entirely — EU data residency or zero-retention policies narrow your choices significantly.

Provider	Streaming rate	Batch rate	Free tier	Speaker diarization	Feature model	BAA available	EU data residency
AssemblyAI Universal-3.5 Pro	$0.45/hr	$0.21/hr	$50 credits	+$0.02/hr (async)	Unbundled	Yes	Yes
Deepgram Nova-3	~$0.46/hr	~$0.26/hr	$200 credits	+$0.02/hr	Unbundled	Yes	No
Google Cloud Speech	~$0.96/hr	~$0.48/hr	60 min/month	+$0.36/hr	Unbundled	Yes	Yes
AWS Transcribe	~$1.44/hr	~$1.44/hr	60 min/month	+$0.14/hr	Unbundled	Yes	Yes
Gladia	~$0.61/hr	~$0.37/hr	10 min trial	Included	Bundled	No	Yes
OpenAI Whisper	N/A	~$0.36/hr	None	Not available	Minimal	No	No

Competitor rates are estimates and subject to change; AssemblyAI's own rates are exact. Several patterns emerge from this comparison:

Cloud platform overhead: Google and AWS require additional infrastructure services that add cost
Feature bundling complexity: Direct price comparison becomes difficult when features are bundled differently
Volume discounts: Most require sales contact, with tiers typically at 50,000+ hours monthly

The cheapest advertised rate rarely reflects your actual cost once you add required features and infrastructure.

Optimize Pricing for Your Workload

Discuss volume discounts, compliance needs, and feature bundles that fit your use case. Our team can help model total costs beyond headline rates.

Talk to an AI expert

AssemblyAI speech-to-text pricing at a glance (as of July 2026)

Here's the full AssemblyAI lineup in one place. Everything is pay-as-you-go, billed per second, with no minimums and unlimited concurrency.

Product / model	Type	Price	Notes
Universal-3.5 Pro	Async (pre-recorded)	$0.21/hr	Flagship, 18-language code-switching
Universal-3.5 Pro (Sync)	Sync (short clips, HTTP POST)	$0.45/hr	~134 ms p50, no polling or WebSocket
Universal-3.5 Pro Realtime	Streaming	$0.45/hr base	Powers the Voice Agent API
Universal-2	Async	$0.15/hr	99+ language coverage
Universal-Streaming (EN / Multilingual)	Streaming	$0.15/hr	—
Whisper-Streaming	Streaming	$0.30/hr	99+ language fallback
Voice Agent API	STT + LLM + TTS (one WebSocket)	$4.50/hr flat	~1s end-to-end latency
Medical Mode	Add-on	+$0.15/hr	Async flagship + Medical = $0.36/hr

Async add-ons: keyterms prompting +$0.05/hr · standard speaker diarization +$0.02/hr · experimental diarization +$0.065/hr · Auto Chapters (Universal-2) +$0.08/hr · Summarization (Universal-2) +$0.03/hr.

Realtime add-ons: diarization with revision +$0.12/hr · prompting +$0.05/hr · voice isolation +$0.10/hr.

For the current, authoritative figures, check the pricing page.

Calculating total cost of ownership

The cheapest provider isn't always the most cost-effective choice. Your total cost depends on accuracy requirements, feature usage, and infrastructure overhead. A developer building internal tools has different needs than a team displaying transcripts to customers.

Cost scenarios by use case

Different applications demand different accuracy levels and features, dramatically affecting total costs. Here's how requirements drive provider selection:

Startup MVP building a podcast search app: Processing 1,000 hours monthly for basic search functionality. Accuracy matters less since search can handle some errors.

Recommended: OpenAI Whisper or AssemblyAI Universal-2 (at $0.15/hr) for cost efficiency
Avoid: premium models with features you don't need

Enterprise contact center with compliance requirements: Processing 10,000 hours monthly where transcripts are reviewed and stored for regulatory purposes.

Recommended: AssemblyAI Universal-3.5 Pro for accuracy, with low-cost speaker diarization (+$0.02/hr on async)
Avoid: cloud platforms that add infrastructure complexity and cost

Healthcare practice transcribing patient consultations: Processing 5,000 hours monthly requiring medical terminology accuracy and a Business Associate Agreement.

Recommended: AssemblyAI Medical Mode for specialized terminology — a +$0.15/hr add-on (async flagship + Medical = $0.36/hr) — with a BAA available
Avoid: generic models that struggle with medical terms

High-volume batch processing for LLM training: Processing 50,000+ hours monthly where transcripts feed AI models but aren't displayed to users.

Recommended: OpenAI Whisper or Universal-2 if accuracy trade-offs are acceptable
Consider: volume discounts from major providers

Hidden costs beyond per-minute pricing

Cloud platform providers add infrastructure overhead that doesn't appear in headline pricing. Google Cloud Speech-to-Text requires Cloud Storage, Cloud Functions, Pub/Sub messaging, and data egress fees. AWS Transcribe needs S3 storage, Lambda functions, and SQS queues.

These services typically add cost:

Cloud storage: $20-50 monthly
Compute functions: $30-80 monthly
Message queues: $10-30 monthly
Data egress: $40-100 monthly

API-first providers eliminate these dependencies — you send audio and receive transcripts without managing infrastructure. Engineering time represents another hidden cost. Setting up production pipelines on cloud platforms takes 20-40 hours for unfamiliar teams. Data residency compliance adds more complexity if you need EU hosting or zero-retention guarantees.

How accuracy affects quality assurance costs

Accuracy differences translate directly to labor costs when humans review transcripts. This matters most for displayed transcripts, compliance records, or systems that rely on semantic accuracy. Consider a support team reviewing transcripts where error rates differ by 3 percentage points. Higher error rates mean more corrections, more review time, and higher labor costs.

The key insight: cost per hour is falling, but cost per correction determines your true expense.

Evaluate Accuracy and Features Live

Try diarization, redaction, and transcription on real audio to see which options you actually need—and how they impact costs.

Try the playground

Choosing the right pricing model for your needs

Selecting the right provider requires matching your specific requirements to available options. Data privacy requirements can eliminate choices before you even consider pricing.

Provider selection criteria:

AssemblyAI: best when accuracy matters, transcripts are displayed, or you need healthcare workloads covered by a BAA with cost efficiency
OpenAI Whisper: ideal for pure LLM input where accuracy trade-offs are acceptable
Gladia: good for teams wanting bundled features with EU data residency
Cloud platforms: consider only if already using that cloud ecosystem extensively

Common evaluation mistakes:

Rate shopping: choosing based on lowest per-minute rate without calculating total cost
Free tier assumptions: underestimating limits before production volume kicks in
Accuracy ignorance: ignoring accuracy until production, when fixing errors becomes expensive
Feature blindness: not accounting for required features in cost calculations
Compliance afterthoughts: discovering data residency requirements eliminate your chosen provider

The right choice balances cost, accuracy, features, and operational complexity for your specific use case. AssemblyAI positions itself as a Voice AI infrastructure platform, so the per-hour rate is only one layer of the value — accurate speech-to-text, Speech Understanding models, the LLM Gateway, domain modes like Medical, and the Voice Agent API all sit on the same platform.

One team that made this trade-off is Veed, the browser-based video editor:

"Assembly allowed our team to focus on what they are best at: Building a collaborative, browser-based video editor and distributing that product at speed and at velocity to our user base." — Sabba Keynejad and Tim Mamedov, Veed

Companies like Uber and ClickUp build on the same platform.

Final words

Speech-to-text API pricing has evolved beyond simple per-minute rates to complex models involving feature bundles, accuracy tiers, and infrastructure overhead. The cheapest advertised rate might become the most expensive choice once you factor in correction costs, required features, and engineering time. Your evaluation should focus on total cost of ownership for your specific use case rather than headline comparisons.

AssemblyAI's transparent pricing keeps speaker diarization to a low +$0.02/hr add-on on async, offers Medical Mode for healthcare applications requiring specialized terminology, and provides an API-first architecture that eliminates infrastructure dependencies. Business Associate Agreements are available for healthcare workloads — delivering predictable costs without the surprise charges common with cloud platform providers.

Model Your Own Speech-to-Text Costs

Get predictable, per-second pricing with no minimums and no surprise infrastructure charges. Get started with $50 in free credits.

Get started free

Frequently asked questions

How much does AssemblyAI speech-to-text cost?

As of July 2026, AssemblyAI's async Universal-3.5 Pro is $0.21/hr and Universal-2 is $0.15/hr. Streaming (Universal-3.5 Pro Realtime) and the Sync API are both $0.45/hr, and the Voice Agent API is a flat $4.50/hr covering speech-to-text, LLM, and text-to-speech. Add-ons include speaker diarization (+$0.02/hr async), keyterms prompting (+$0.05/hr), and Medical Mode (+$0.15/hr). Billing is per second with no minimums.

What does base speech-to-text pricing typically include?

Base pricing covers audio-to-text conversion with basic punctuation and capitalization. Most providers charge extra for features like speaker identification, sentiment analysis, or medical terminology recognition.

How much do speaker diarization features cost?

Speaker diarization costs vary significantly by provider — from a +$0.02/hr add-on on AssemblyAI async to +$0.36 per hour with Google Cloud. Bundled providers like Gladia include it in their flat rate while unbundled providers add surcharges.

Can I estimate my monthly costs before signing up?

Yes. Calculate your expected monthly audio hours, multiply by the provider's rate, add required feature costs, and include infrastructure overhead for cloud platforms. Test with real audio during free trials to verify the provider meets your accuracy requirements.

How do I handle EU data residency requirements?

EU data residency eliminates many providers from consideration. AssemblyAI offers EU data residency at the same price, and Gladia offers native EU hosting, while major cloud providers support regional data processing. Verify compliance requirements early since retrofitting data residency is complex and expensive.

‍

Speech-to-text API pricing guide: Per-minute, per-hour and feature costs explained

How speech-to-text APIs charge for usage

Per-second vs per-minute vs block-based billing

Streaming, sync, and batch processing costs

Standard vs premium model pricing

Bundled vs unbundled feature pricing

Speech-to-text API pricing comparison (as of July 2026)

AssemblyAI speech-to-text pricing at a glance (as of July 2026)

Calculating total cost of ownership

Cost scenarios by use case

Hidden costs beyond per-minute pricing

How accuracy affects quality assurance costs

Choosing the right pricing model for your needs

Final words

Frequently asked questions

How much does AssemblyAI speech-to-text cost?

What does base speech-to-text pricing typically include?

How much do speaker diarization features cost?

Can I estimate my monthly costs before signing up?

How do I handle EU data residency requirements?

5 Amazon Transcribe alternatives in 2026

AssemblyAI vs Deepgram for medical transcription

Build a dictation app with the Sync API

Bring your own orchestration: the sync HTTP pattern for voice agents

How to build an AI medical scribe with AssemblyAI

Emergent Abilities of Large Language Models

Transcribe phone calls in real-time in Go with Twilio and AssemblyAI

9 best AI subtitle generators for 2025

Speech-to-text API pricing guide: Per-minute, per-hour and feature costs explained

How speech-to-text APIs charge for usage

Per-second vs per-minute vs block-based billing

Streaming, sync, and batch processing costs

Standard vs premium model pricing

Bundled vs unbundled feature pricing

Speech-to-text API pricing comparison (as of July 2026)

AssemblyAI speech-to-text pricing at a glance (as of July 2026)

Calculating total cost of ownership

Cost scenarios by use case

Hidden costs beyond per-minute pricing

How accuracy affects quality assurance costs

Choosing the right pricing model for your needs

Final words

Frequently asked questions

How much does AssemblyAI speech-to-text cost?

What does base speech-to-text pricing typically include?

How much do speaker diarization features cost?

Can I estimate my monthly costs before signing up?

How do I handle EU data residency requirements?

Related posts

5 Amazon Transcribe alternatives in 2026

AssemblyAI vs Deepgram for medical transcription

Build a dictation app with the Sync API

Bring your own orchestration: the sync HTTP pattern for voice agents

How to build an AI medical scribe with AssemblyAI

Emergent Abilities of Large Language Models

Transcribe phone calls in real-time in Go with Twilio and AssemblyAI

9 best AI subtitle generators for 2025