Bulk transcription and load testing
This guide applies to two closely related workloads:
- Bulk transcription. Submitting thousands of files in a batch — for example, a nightly backfill or a one-time migration.
- Load testing. Measuring turnaround time (TaT) and throughput before a production cutover.
The guidance is shared. If you’re running a load test, also read the load test subsection of Measure and verify.
Jump to: Before you begin · Coordinate · When to run · Ramp · Traffic limits · Measure and verify · Diagnosing problems · Reference script · Checklist
Key recommendations
Before you begin
Prerequisites
- Account balance. If your balance hits zero mid-run, AssemblyAI drops your concurrency limit to 1 and invalidates your results. Bulk runs and load tests both incur standard transcription usage charges — fund your account for the full expected volume before you submit any requests.
- Concurrency limit. The default for paid accounts is 200 concurrent jobs. Check your current limit on the Rate Limits page of your dashboard. If you need a higher limit, contact support — AssemblyAI offers custom concurrency limits at no additional cost.
- Audio source. Prefer pre-signed URLs (for example, from S3 or GCS) as your
audio_url. Each/v2/uploadcall counts against your HTTP rate limit and adds latency proportional to file size — at bulk scale, uploads alone can exhaust your rate-limit budget. If local files are your only option, upload them to/v2/uploadfirst to get a hosted URL, and factor those uploads into your rate-limit planning. - Completion tracking. Configure webhooks or a polling strategy before you start — you’ll need a way to detect completion and record per-request timestamps.
Workload configuration
Match these to your expected production traffic:
- Region: US or EU. The EU region handles less traffic than US and is more sensitive to load spikes. Coordinate with our team for any EU workload, regardless of size.
- Traffic pattern. Requests per minute at peak and whether traffic is steady or bursty.
- Models and features. Universal-3 Pro, Universal-2, speaker labels, PII redaction, etc. — each has different processing characteristics.
- Language. Some languages, such as Hindi, Swedish, and Hebrew, scale differently and may show longer TaT. Coordinate with our team if your workload is primarily in a less common language.
- Audio format. Format conversion and preprocessing run before transcription and contribute to overall TaT.
Coordinate with our team
Reach out to support@assemblyai.com or your account manager before you submit any requests if either of these applies:
- You plan to exceed 200 requests per minute.
- You’re running a large one-time upload (tens of thousands of files or more).
AssemblyAI can pre-scale pipeline components for your traffic, raise your concurrency limit, and monitor the run in real time. For recurring bulk workloads (for example, nightly batch jobs), we can set up persistent scaling.
When you reach out, include:
- Expected request volume and ramp schedule, broken into 15-second windows
- Audio file durations and language breakdown
- Speech models and features you’ll enable
- Whether audio is single-channel or multichannel
- Preferred run window (see When to run)
When to run
- Small tests and moderate bulk runs (well within your concurrency limit): US business hours (roughly 14:00–21:00 UTC) produce the most representative baseline latency, since throughput is highest during these periods.
- Large runs (200+ requests/minute): don’t start during business hours without coordination. A large uncoordinated spike during peak traffic can degrade performance for other customers and produce inaccurate results. Our team will pick a window and pre-scale for you.
- EU region: coordinate regardless of size.
Ramp up gradually
The most common mistake, for both bulk uploads and load tests, is submitting all requests at once. A gradual ramp gives pipeline components time to scale and produces results that match steady-state production performance.
Recommended schedule (validated for Universal-3 Pro with speaker labels):
- Divide your ramp into 15-second windows.
- Start at 25 requests per window.
- Grow by ~8–9% per window until you reach your target sustained rate.
- Keep traffic continuous between windows — gaps produce less representative results and can induce unexpected throttling.
- Do not exceed your account’s concurrency limit during the ramp.
If you’re using different models or features, contact support for a tailored ramp plan — some components take longer to initialize and may need a slower ramp.
Example: ramp to 100 requests/window (400/minute)
After reaching your target rate, sustain it for at least 5–10 minutes. Your measurements are representative once p50 and p95 stay consistent over 2–3 consecutive minutes. If p50 is still falling, extend the sustain phase.
How traffic limits work
Three separate mechanisms affect how quickly your requests get processed:
- Traffic shaping controls how fast new submissions begin processing. Your account starts with an allowance of 200 requests per 15-second window. When you exceed the allowance, AssemblyAI delays excess requests into the next window rather than rejecting them. The allowance grows by ~10% per window as long as demand continues, and no single request is delayed by more than 300 seconds.
- HTTP rate limits cap total API requests at 20,000 per 5 minutes across all endpoints — submissions and polling combined. Exceeding this returns a
429error. - Concurrency limits cap how many jobs run in parallel. The default is 5 concurrent jobs for free accounts and 200+ for paid accounts. AssemblyAI automatically queues requests that exceed your limit in FIFO order and begins processing them as slots open. You won’t receive an error — the requests simply wait.
Measure and verify
For every run — bulk or load test — track completion and record per-request metadata.
- Use webhooks when possible. You get a clean completion signal without polling overhead. See the Webhooks documentation for setup, retry behavior, and authentication.
- Polling every 1–2 seconds measures closer to actual completion but adds to your rate-limit budget. Use it only when webhooks aren’t available or when you need precise TaT during a load test.
- Retry failures with exponential backoff for 5xx responses — see Implement retry server error logic. Investigate 4xx responses; they indicate a client-side issue that retrying won’t fix.
Record these fields per request:
submit_ts— timestamp whenPOST /v2/transcriptwas sentcomplete_ts— timestamp when completion was detectedaudio_duration— length of the audio file, in secondsmodel— speech model usedfeatures— features enabled (e.g.speaker_labels,auto_highlights,sentiment_analysis)status—completedorerrorid— transcript ID, for debugging with support
Turnaround time = complete_ts − submit_ts.
Polling without exceeding the rate limit
If webhooks aren’t an option, stay within the 20,000 requests per 5 minutes budget. As an illustration: at a sustained 33 requests/second submission rate with ~330 jobs in flight, submissions alone consume roughly 10,000 of your budget — so poll no more often than every 15 seconds to leave headroom. Scale these numbers to your own submission rate and in-flight job count:
If you’re running a bulk job
Monitor for these signals during the run:
- Healthy run: TaT stays within ~20% of your first 5 minutes of sustained submissions, your completion queue drains steadily, and no
429errors or throttle emails arrive. - Diagnose, don’t panic: if something goes wrong, match the signal you see to the row in Diagnosing problems and respond accordingly — each signal has a different cause and fix.
If you’re running a load test
- Separate ramp-phase from sustain-phase metrics. Expect higher latency during ramp; use sustain-phase numbers as your benchmark.
- Report percentiles, not averages. Track p50, p75, p90, p95, p99, and max.
- Normalize by audio duration. Group results into duration buckets (0–5 min, 5–15 min, 15–30 min, 30–60 min) for meaningful comparison.
- If p95 is more than 2× p50 during sustain, inspect the tail for very long or multichannel files.
- Pre-scaling caveat. If our team pre-scaled for your test, your results reflect steady-state capacity — not cold-start or scale-up behavior.
Diagnosing problems
Reference implementation
When managing a large rate-limit budget, calling the API directly gives you more control than the SDKs’ internal polling behavior. If you prefer to use an SDK, see Transcribe multiple files simultaneously.
The following script ramps submissions to approximate the recommended schedule. The table above is hand-tuned to observed pipeline behavior, so the numbers this script produces may differ by one or two requests per window. Adjust max_rate to your target sustained rate.
Quick checklist
Both workloads
- Coordinated with AssemblyAI if your run exceeds 200 requests/minute or is a large bulk upload
- Audio source is a pre-signed URL where possible
- Account balance is sufficient for the full run
- Models, features, languages, and audio durations match production
- 15-second ramp designed (25 requests/window starting rate, ~8–9% growth)
- Submission rate stays within your concurrency limit
- Webhook endpoint or polling strategy configured
- Logging
submit_ts,complete_ts,audio_duration,model,features, andstatusper request
Load tests only
- Plan to sustain target rate for at least 5–10 minutes
- Separate ramp-phase and sustain-phase metrics
- Test window chosen per When to run