Batch transcription at scale: turnaround, throughput, and concurrency
Transcribing one file is a tutorial; transcribing a million is an operations problem. At volume, the questions that decide whether your pipeline holds up aren't about a single transcript—they're about throughput, turnaround, concurrency limits, and what the bill looks like when the numbers get big. Here's how to run pre-recorded (async) transcription at scale and keep it fast and predictable when the volume is real.



Transcribing one file is a tutorial. Transcribing a million is an operations problem. When you're running call archives, a media back-catalog, or a daily flood of recordings through a speech-to-text API, the questions that decide whether your pipeline holds up aren't about a single transcript — they're about throughput, turnaround at volume, concurrency limits, and what the bill looks like when the numbers get big.
This is the part of batch transcription that rarely makes the benchmark pages. Here's how to think about running pre-recorded (async) transcription at scale, and how to keep it fast and predictable when the volume is real. (If you're weighing AssemblyAI against a specific provider on raw accuracy and speed, the head-to-head AssemblyAI vs. Deepgram comparison is the better place to start — this piece is about operating a batch pipeline once you've chosen one.)
Throughput is a different metric than latency
The first mental shift: at scale you care about throughput, not per-file latency. Latency is "how fast does one file come back." Throughput is "how many hours of audio can I clear per hour of wall-clock time." For a live call, latency is everything. For a 200,000-file backlog, throughput is everything — a pipeline that returns each file a few seconds slower but runs thousands in parallel finishes the whole job far sooner.
This matters because the model that wins a single-file speed test isn't necessarily the one that clears your backlog fastest. What clears a backlog fastest is high concurrency, and that's an infrastructure property, not a model property.
Concurrency: the limit that actually bites
The constraint most teams hit at scale isn't price or accuracy — it's a concurrency ceiling. Many APIs cap how many requests you can have in flight at once (often a few dozen on standard plans). The moment your submission rate exceeds that cap, files queue, and your effective throughput is throttled no matter how fast the underlying model is.
AssemblyAI offers unlimited concurrency on standard pay-as-you-go, and the platform processes millions of hours of audio without outages. Practically, that means you can submit your entire backlog at once and let it process in parallel rather than metering submissions to stay under a limit. For a batch operation, removing the concurrency ceiling is often a bigger throughput win than any per-file speed difference — it's the difference between "done overnight" and "done next week."
The async turnaround trade-off, stated honestly
There's a real trade-off worth naming. Universal-3 Pro is a large, LLM-based model — that's where its accuracy lead comes from — and a larger model does more computation per minute of audio than a small one. On a single file, a smaller model may return first. If your pipeline has a hard per-file turnaround SLA measured in seconds, test that explicitly.
For most batch workloads, though, this is the wrong thing to optimize. Batch is rarely latency-critical the way a live call is; you're processing recordings after the fact, and aggregate throughput is what determines when the job finishes. When you do need a long file back quickly, you don't have to accept naive single-file processing time — which brings us to the highest-leverage technique.
Chunk and parallelize for fast turnaround
To pull turnaround down on a long file, split it into chunks, transcribe them concurrently, and stitch the results back together. Because you're processing many short segments at once instead of one long stream end to end, total turnaround drops sharply — and with unlimited concurrency, the parallelism is free to use.
This is a proven pattern, not a theory. A clinical-documentation team combined chunking with the LLM Gateway and measured a 64.6% reduction in end-to-end turnaround — down to roughly 13.5 seconds on their test files — while holding 97.9% transcription accuracy and 99.4% diarization accuracy, at an orchestration cost of about $0.001 per file. The one thing to manage: overlap chunk boundaries slightly so a word or speaker turn isn't split at the seam, then reconcile on stitch.
Cost at volume
At scale, the per-hour rate compounds, so it's worth being precise. Universal-3 Pro is $0.21/hr and Universal-2 is $0.15/hr, both billed per second of audio with no minimums and no concurrency surcharge. Add-ons are per hour on top: keyterms prompting (+$0.05/hr), standard diarization (+$0.02/hr). Because billing is per-second, you pay for exactly the audio you send — no rounding up per request, which quietly matters when you're processing millions of short files. The full breakdown is on the pricing page.
One real-world shape of this: an insurance-analytics team processing on the order of hundreds of calls a day moved to AssemblyAI and won on a combination of faster async turnaround at their volume and domain accuracy on insurance terminology — the throughput and the transcript quality together, not either alone.
A batch-at-scale checklist
Pulling it together: choose async, not streaming, for any recorded-audio pipeline. Lean on unlimited concurrency — submit broadly rather than metering to a cap. Chunk-and-parallelize the long files that have turnaround targets. Load domain vocabulary as keyterms so volume doesn't mean volume-of-errors. Price the job per second at the model rate plus add-ons. And keep the head-to-head accuracy/speed evaluation separate from the operations question — pick the provider on the comparison, then build the pipeline on the principles here.
Scale doesn't have to mean fragility. Treated as a throughput problem — concurrency, parallelism, per-second cost — high-volume transcription becomes boringly predictable, which is exactly what you want from infrastructure. The fastest way to size your real turnaround and cost is to push a representative slice of your actual volume through the pipeline and measure it.
Frequently asked questions
What matters most for transcription at scale: speed or throughput?
Throughput. Per-file latency is how fast one file returns; throughput is how many hours of audio you clear per hour of wall-clock time, and that's what determines when a large batch finishes. High concurrency — processing many files in parallel — clears a backlog faster than a model that's marginally quicker on any single file.
Does AssemblyAI limit how many files I can transcribe at once?
No. AssemblyAI offers unlimited concurrency on standard pay-as-you-go plans and processes millions of hours of audio without outages, so you can submit a full backlog in parallel rather than metering submissions under a cap. Removing the concurrency ceiling is often a larger throughput gain at scale than any per-file speed difference.
How do I make a long recording transcribe faster?
Split it into chunks, transcribe them concurrently, and stitch the results back together. One clinical team cut end-to-end turnaround by 64.6% with this chunk-and-parallelize pattern while holding 97.9% accuracy. Overlap chunk boundaries slightly so words and speaker turns aren't split at the seams.
How much does high-volume transcription cost?
Universal-3 Pro is $0.21/hr and Universal-2 is $0.15/hr, billed per second of audio with no minimums and no concurrency surcharge. Add-ons like keyterms (+$0.05/hr) and standard diarization (+$0.02/hr) are per hour on top. Per-second billing means you don't pay rounded-up time per request, which adds up across millions of short files.
Is a smaller, faster model better for batch transcription?
Not usually. A smaller model may return an individual file faster, but batch jobs are rarely latency-critical, and a larger, more accurate model like Universal-3 Pro paired with unlimited concurrency clears aggregate volume quickly while producing transcripts you don't have to re-clean. Optimize for throughput and accuracy unless you have a hard per-file turnaround SLA.
Should I use streaming or async transcription for a large backlog?
Use async (pre-recorded). Async is billed per second of audio, supports unlimited concurrency for parallel processing, and gives the model the full file for higher accuracy. Streaming is billed on connection time and designed for live, real-time use, which makes it the wrong tool for processing a recorded backlog.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.




