Running Bulk Transcription and Load Tests at Scale
This guide applies to two closely related workloads:
- Bulk transcription. Submitting thousands of files in a batch — for example, a nightly backfill or a one-time migration.
- Load testing. Measuring turnaround time (TaT) and throughput before a production cutover.
The guidance is shared. If you’re running a load test, also read the load test subsection of Measure and verify.
Key recommendations
Default concurrency for paid accounts is 200 concurrent jobs. If you need a higher limit, reach out to support@assemblyai.com — AssemblyAI offers custom concurrency limits at no additional cost.
Before you begin
Prerequisites
- Account balance. If your balance hits zero mid-run, AssemblyAI drops your concurrency limit to 1 and invalidates your results. Bulk runs and load tests both incur standard transcription usage charges — fund your account for the full expected volume before you submit any requests.
- Concurrency limit. Check your current limit on the Rate Limits page of your dashboard and size your target submission rate against it — see Size your target rate.
- Audio source. Prefer pre-signed URLs (for example, from S3 or GCS) as your
audio_url. Each/v2/uploadcall counts against your HTTP rate limit and adds latency proportional to file size — at bulk scale, uploads alone can exhaust your rate-limit budget. If local files are your only option, upload them to/v2/uploadfirst to get a hosted URL, and factor those uploads into your rate-limit planning. - Pre-signed URL expiration. Set URL TTLs long enough to outlast your expected queue time plus turnaround time. One hour is a safe default for small-to-moderate runs; use two hours or more for runs that approach your concurrency limit or use less-common languages. URLs that expire while a job is queued or processing surface as
4xxerrors when AssemblyAI tries to retrieve the audio. - Completion tracking. Configure webhooks (with a polling fallback) or a polling-only strategy before you start — you’ll need a way to detect completion and record per-request timestamps.
Workload configuration
Match these to your expected production traffic:
- Region: US or EU. The EU region handles less traffic than US and is more sensitive to load spikes. Coordinate with our team for any EU workload, regardless of size.
- Traffic pattern. Requests per minute at peak and whether traffic is steady or bursty.
- Models and features. Universal-3 Pro, Universal-2, speaker labels, PII redaction, summarization, and so on — each has different processing characteristics, and every feature you enable adds to TaT. Audit your request body and turn off what you don’t need; a faster run is also a cheaper one.
- Language. Some languages, such as Hindi, Swedish, and Hebrew, scale differently and may show longer TaT. Coordinate with our team if your workload is primarily in a less common language.
- Audio format. Format conversion and preprocessing run before transcription and contribute to overall TaT.
- Cost estimate. Sanity-check the expected spend before you submit: total audio hours × your per-hour rate. Bulk runs are billed the same as any other transcription, so a large backfill can produce a surprisingly large invoice if you haven’t projected it.
Coordinate with our team
Reach out to support@assemblyai.com or your account manager before you submit any requests if any of these apply:
- You plan to exceed 200 requests per minute.
- You’re running a large one-time upload (tens of thousands of files or more).
- You want support available during the run — for example, if you’re running outside US business hours.
- You’re using the EU region, regardless of size.
When you reach out, include:
- Expected request volume and ramp schedule, broken into 15-second windows
- Audio file durations and language breakdown
- Speech models and features you’ll enable
- Whether audio is single-channel or multichannel
- Preferred run window (see When to run)
AssemblyAI can pre-scale pipeline components for your traffic, raise your concurrency limit, and monitor the run in real time. For recurring bulk workloads (for example, nightly batch jobs), we can set up persistent scaling.
Pilot first
Before a full bulk run or load test, submit a pilot batch of 50–200 files using the exact configuration you plan to use at scale — same model, same features, same language, same webhook receiver, same error-handling logic. A pilot verifies that:
- The transcripts look right, and the model and feature set match what the downstream consumer expects.
- Your webhook receiver is reachable, verifying signatures, and writing results durably — or your polling loop is keeping up without hitting rate limits.
- Your retry logic handles
5xxcorrectly and your dead-letter path captures4xxwithout silently dropping files. - Your ramp and concurrency controls behave as intended.
The most expensive bulk-run failures almost always come from discovering a configuration mistake — a wrong model, a feature flag left off, a webhook handler that drops results silently — after the whole batch has been billed. A pilot catches these while they’re cheap to fix, and gives you a realistic mean TaT to plug into Size your target rate.
When to run
- Small tests and moderate bulk runs (well within your concurrency limit): US business hours (roughly 14:00–21:00 UTC) produce the most representative baseline latency, since throughput is highest during these periods.
- Large runs (200+ requests/minute): coordinate with our team before starting. Our team will pick a window and pre-scale for you.
- EU region: coordinate regardless of size.
Ramp up gradually
The most common mistake — for both bulk uploads and load tests — is submitting all requests at once. A gradual ramp gives the pipeline time to scale ahead of your traffic, which is what produces the lowest and most consistent turnaround times. Submitting a large spike upfront typically results in higher TaT as capacity catches up.
Recommended schedule (validated for Universal-3 Pro with speaker labels):
- Divide your ramp into 15-second windows.
- Start at 25 requests per window.
- Grow by ~8–9% per window until you reach your target sustained rate.
- Don’t pause mid-ramp. Stopping and restarting means ramping from the starting rate again, and you’ll see higher latency when traffic resumes.
- Do not exceed your account’s concurrency limit during the ramp.
If you’re using different models or features, contact support for a tailored ramp plan — some components take longer to initialize and may need a slower ramp.
Example: ramp to 100 requests/window (400/minute)
To compute a ramp for any target without referring to the table, use rate_n = ceil(25 × 1.085ⁿ) (capped at your target rate), where n is the window index starting at 0.
After reaching your target rate, sustain it for at least 5–10 minutes. Your measurements are representative once p50 and p95 stay consistent over 2–3 consecutive minutes. If p50 is still falling, extend the sustain phase.
Size your target rate
Your concurrency limit caps how many jobs can be in progress at once, so your sustained submission rate needs to fit inside it. Pick a target rate that keeps the typical number of in-flight jobs comfortably below your limit:
- Estimate mean turnaround time from your pilot run or the published benchmarks.
- Multiply your target submission rate (requests per second) by that mean TaT to approximate the number of jobs that will be in flight at steady state.
- Keep that number under ~80% of your concurrency limit. If it’s higher, lower the target rate, shorten mean TaT (fewer features, shorter audio, a faster model), or request a higher concurrency limit.
The 20% of headroom absorbs normal variation in audio duration, warm-up effects, and webhook-receive latency. Runs sized to the limit exactly will see TaT climb as the in-flight count bumps against the cap.
Measure and verify
For every run — bulk or load test — track completion and record per-request metadata.
- Use webhooks when possible. You get a clean completion signal without polling overhead. See the Webhooks documentation for setup, retry behavior, and authentication.
- Run a polling fallback alongside webhooks. Webhooks can drop for many reasons — receiver downtime, signature mismatches, transient network failures. For every submitted job, record an expected completion deadline (around 2× mean TaT from your pilot) and
GET /v2/transcript/{id}for any job whose webhook hasn’t arrived by then. The fallback protects you from silent data loss without materially increasing your rate-limit usage. - Polling-only every 1–2 seconds measures closer to actual completion but adds to your rate-limit budget. Use it when webhooks aren’t available or when you need precise TaT during a load test.
- Retry failures with exponential backoff for
5xxresponses — see Implement retry server error logic. Investigate4xxresponses; they indicate a client-side issue that retrying won’t fix.
Record these fields per request:
submit_ts— timestamp whenPOST /v2/transcriptwas sentcomplete_ts— timestamp when completion was detectedaudio_duration— length of the audio file, in secondsmodel— speech model usedfeatures— features enabled (e.g.speaker_labels,auto_highlights,sentiment_analysis)status—completedorerrorid— transcript ID, for debugging with support
Turnaround time = complete_ts − submit_ts.
Normalize TaT by audio duration to get the real-time factor: RTF = turnaround_time ÷ audio_duration. An RTF of 0.5 means the API processed the file in half its audio duration. RTF is the headline metric for comparing runs across regions, models, and audio-duration buckets — raw TaT varies too much with audio length to mean anything on its own.
Polling without exceeding the rate limit
HTTP rate limits cap total API requests at 20,000 per 5 minutes across all endpoints — submissions and polling combined. Exceeding this returns a 403 error.
If webhooks aren’t an option, stay within that budget. As an illustration: at a sustained 33 requests/second submission rate with ~330 jobs in flight, submissions alone consume roughly 10,000 of your budget — so poll no more often than every 15 seconds to leave headroom. Scale these numbers to your own submission rate and in-flight job count:
When many jobs share the same polling interval they tend to cluster at the same second boundaries, spiking your rate-limit usage and occasionally returning 403 errors. Stagger each job’s polling by ±25% of the interval (for example, interval × random.uniform(0.75, 1.25)) so GETs spread evenly across the window.
If you’re running a bulk job
Monitor for these signals during the run:
- Healthy run: TaT stays within ~20% of your first 5 minutes of sustained submissions, your completion queue drains steadily, and no errors arrive.
- Diagnose, don’t panic: if something goes wrong, match the signal you see to the row in Diagnosing problems and respond accordingly — each signal has a different cause and fix.
If you’re running a load test
- Separate ramp-phase from sustain-phase metrics. Expect higher latency during ramp; use sustain-phase numbers as your benchmark.
- Report percentiles, not averages. Track p50, p75, p90, p95, p99, and max.
- Normalize by audio duration. Group results into duration buckets (0–5 min, 5–15 min, 15–30 min, 30–60 min) for meaningful comparison.
- Pre-scaling caveat. If our team pre-scaled for your test, your results reflect steady-state capacity — not cold-start or scale-up behavior.
Diagnosing problems
Reference implementation
When managing a large rate-limit budget, calling the API directly gives you more control than the SDKs’ internal polling behavior. If you prefer to use an SDK, see Transcribe multiple files simultaneously.
The following script ramps submissions to approximate the recommended schedule, retries transient server errors, writes unrecoverable failures to a dead-letter log, and persists submitted file → transcript_id pairs so the run is resumable after a crash. The table above is hand-tuned to observed pipeline behavior, so the numbers this script produces may differ by one or two requests per window. Adjust max_rate to your target sustained rate.
Quick checklist
Before the run
- Account balance is sufficient for the full run
- Models, features, languages, and audio durations match production
- Audio source is a pre-signed URL with a TTL longer than queue time + TaT
- Pilot batch (50–200 files) completed end-to-end with the same configuration
- Target rate sized against your concurrency limit — see Size your target rate
- 15-second ramp designed (25 requests/window starting rate, ~8–9% growth)
- Webhook endpoint configured, with a polling fallback for missed callbacks
- Resumable submission log + dead-letter log in place
- Coordinated with AssemblyAI if your run exceeds 200 requests/minute, is a large bulk upload, uses the EU region, or you want support available during the run
During the run
- Logging
submit_ts,complete_ts,audio_duration,model,features, andstatusper request - Monitoring the healthy-run signals; ready to match against Diagnosing problems if anything drifts
Load tests only
- Plan to sustain target rate for at least 5–10 minutes
- Separate ramp-phase and sustain-phase metrics
- Reporting RTF alongside TaT percentiles
- Test window chosen per When to run