Bulk transcription and load testing | AssemblyAI

This guide applies to two closely related workloads:

Bulk transcription. Submitting thousands of files in a batch — for example, a nightly backfill or a one-time migration.
Load testing. Measuring turnaround time (TaT) and throughput before a production cutover.

The guidance is shared. If you’re running a load test, also read the load test subsection of Measure and verify.

Jump to: Before you begin · Coordinate · When to run · Ramp · Traffic limits · Measure and verify · Diagnosing problems · Reference script · Checklist

Key recommendations

Category	Recommendation
Ramp	Submit in 15-second windows. Start at 25 requests/window and grow ~8–9% per window until you reach your target sustained rate.
Measure	Use webhooks instead of polling. Record `submit_ts`, `complete_ts`, `audio_duration`, `model`, `features`, and `status` per request.
Coordinate	Required for 200+ requests/minute and for large bulk uploads (tens of thousands of files).

Before you begin

Prerequisites

Account balance. If your balance hits zero mid-run, AssemblyAI drops your concurrency limit to 1 and invalidates your results. Bulk runs and load tests both incur standard transcription usage charges — fund your account for the full expected volume before you submit any requests.
Concurrency limit. The default for paid accounts is 200 concurrent jobs. Check your current limit on the Rate Limits page of your dashboard. If you need a higher limit, contact support — AssemblyAI offers custom concurrency limits at no additional cost.
Audio source. Prefer pre-signed URLs (for example, from S3 or GCS) as your audio_url. Each /v2/upload call counts against your HTTP rate limit and adds latency proportional to file size — at bulk scale, uploads alone can exhaust your rate-limit budget. If local files are your only option, upload them to /v2/upload first to get a hosted URL, and factor those uploads into your rate-limit planning.
Completion tracking. Configure webhooks or a polling strategy before you start — you’ll need a way to detect completion and record per-request timestamps.

Workload configuration

Match these to your expected production traffic:

Region: US or EU. The EU region handles less traffic than US and is more sensitive to load spikes. Coordinate with our team for any EU workload, regardless of size.
Traffic pattern. Requests per minute at peak and whether traffic is steady or bursty.
Models and features. Universal-3 Pro, Universal-2, speaker labels, PII redaction, etc. — each has different processing characteristics.
Language. Some languages, such as Hindi, Swedish, and Hebrew, scale differently and may show longer TaT. Coordinate with our team if your workload is primarily in a less common language.
Audio format. Format conversion and preprocessing run before transcription and contribute to overall TaT.

Coordinate with our team

Reach out to support@assemblyai.com or your account manager before you submit any requests if either of these applies:

You plan to exceed 200 requests per minute.
You’re running a large one-time upload (tens of thousands of files or more).

AssemblyAI can pre-scale pipeline components for your traffic, raise your concurrency limit, and monitor the run in real time. For recurring bulk workloads (for example, nightly batch jobs), we can set up persistent scaling.

When you reach out, include:

Expected request volume and ramp schedule, broken into 15-second windows
Audio file durations and language breakdown
Speech models and features you’ll enable
Whether audio is single-channel or multichannel
Preferred run window (see When to run)

When to run

Small tests and moderate bulk runs (well within your concurrency limit): US business hours (roughly 14:00–21:00 UTC) produce the most representative baseline latency, since throughput is highest during these periods.
Large runs (200+ requests/minute): don’t start during business hours without coordination. A large uncoordinated spike during peak traffic can degrade performance for other customers and produce inaccurate results. Our team will pick a window and pre-scale for you.
EU region: coordinate regardless of size.

Ramp up gradually

The most common mistake, for both bulk uploads and load tests, is submitting all requests at once. A gradual ramp gives pipeline components time to scale and produces results that match steady-state production performance.

Recommended schedule (validated for Universal-3 Pro with speaker labels):

Divide your ramp into 15-second windows.
Start at 25 requests per window.
Grow by ~8–9% per window until you reach your target sustained rate.
Keep traffic continuous between windows — gaps produce less representative results and can induce unexpected throttling.
Do not exceed your account’s concurrency limit during the ramp.

If you’re using different models or features, contact support for a tailored ramp plan — some components take longer to initialize and may need a slower ramp.

Example: ramp to 100 requests/window (400/minute)

Time window	Requests	Cumulative
0:00–0:15	25	25
0:15–0:30	27	52
0:30–0:45	29	81
0:45–1:00	31	112
1:00–1:15	33	145
1:15–1:30	35	180
1:30–1:45	38	218
1:45–2:00	41	259
2:00–2:15	44	303
2:15–2:30	47	350
2:30–2:45	51	401
2:45–3:00	55	456
3:00–3:15	59	515
3:15–3:30	64	579
3:30–3:45	69	648
3:45–4:00	75	723
4:00–4:15	81	804
4:15–4:30	88	892
4:30–4:45	95	987
4:45–5:00	100	1,087

After reaching your target rate, sustain it for at least 5–10 minutes. Your measurements are representative once p50 and p95 stay consistent over 2–3 consecutive minutes. If p50 is still falling, extend the sustain phase.

How traffic limits work

Three separate mechanisms affect how quickly your requests get processed:

Traffic shaping controls how fast new submissions begin processing. Your account starts with an allowance of 200 requests per 15-second window. When you exceed the allowance, AssemblyAI delays excess requests into the next window rather than rejecting them. The allowance grows by ~10% per window as long as demand continues, and no single request is delayed by more than 300 seconds.
HTTP rate limits cap total API requests at 20,000 per 5 minutes across all endpoints — submissions and polling combined. Exceeding this returns a 429 error.
Concurrency limits cap how many jobs run in parallel. The default is 5 concurrent jobs for free accounts and 200+ for paid accounts. AssemblyAI automatically queues requests that exceed your limit in FIFO order and begins processing them as slots open. You won’t receive an error — the requests simply wait.

Measure and verify

For every run — bulk or load test — track completion and record per-request metadata.

Use webhooks when possible. You get a clean completion signal without polling overhead. See the Webhooks documentation for setup, retry behavior, and authentication.
Polling every 1–2 seconds measures closer to actual completion but adds to your rate-limit budget. Use it only when webhooks aren’t available or when you need precise TaT during a load test.
Retry failures with exponential backoff for 5xx responses — see Implement retry server error logic. Investigate 4xx responses; they indicate a client-side issue that retrying won’t fix.

Record these fields per request:

submit_ts — timestamp when POST /v2/transcript was sent
complete_ts — timestamp when completion was detected
audio_duration — length of the audio file, in seconds
model — speech model used
features — features enabled (e.g. speaker_labels, auto_highlights, sentiment_analysis)
status — completed or error
id — transcript ID, for debugging with support

Turnaround time = complete_ts − submit_ts.

Polling without exceeding the rate limit

If webhooks aren’t an option, stay within the 20,000 requests per 5 minutes budget. As an illustration: at a sustained 33 requests/second submission rate with ~330 jobs in flight, submissions alone consume roughly 10,000 of your budget — so poll no more often than every 15 seconds to leave headroom. Scale these numbers to your own submission rate and in-flight job count:

Polling interval	GETs/s at 330 in-flight	Total req/s (at 33 POST/s)	Within limit?
Every 3s	~110	~143	No
Every 5s	~66	~99	No
Every 10s	~33	~66	Yes
Every 15s	~22	~55	Yes

If you’re running a bulk job

Monitor for these signals during the run:

Healthy run: TaT stays within ~20% of your first 5 minutes of sustained submissions, your completion queue drains steadily, and no 429 errors or throttle emails arrive.
Diagnose, don’t panic: if something goes wrong, match the signal you see to the row in Diagnosing problems and respond accordingly — each signal has a different cause and fix.

If you’re running a load test

Separate ramp-phase from sustain-phase metrics. Expect higher latency during ramp; use sustain-phase numbers as your benchmark.
Report percentiles, not averages. Track p50, p75, p90, p95, p99, and max.
Normalize by audio duration. Group results into duration buckets (0–5 min, 5–15 min, 15–30 min, 30–60 min) for meaningful comparison.
If p95 is more than 2× p50 during sustain, inspect the tail for very long or multichannel files.
Pre-scaling caveat. If our team pre-scaled for your test, your results reflect steady-state capacity — not cold-start or scale-up behavior.

Diagnosing problems

Signal	Meaning	Response
Throttle email alert	You’re exceeding your concurrency limit	Slow your submission rate or request a higher concurrency limit
`429` HTTP error	You’re exceeding the 20,000 requests per 5 minutes HTTP rate limit (polling counts)	Increase polling interval or switch to webhooks; throttle submissions
`4xx` submission error (other than 429)	Client-side issue (bad request, auth, invalid audio URL, etc.)	Inspect the response body and fix the request; retrying won’t help
`5xx` submission error	Transient server-side issue	Retry with exponential backoff — see the retry guide
TaT rises with no errors and no throttle email	Your ramp rate is exceeding the system’s ability to scale	Slow the ramp or extend its duration

Reference implementation

When managing a large rate-limit budget, calling the API directly gives you more control than the SDKs’ internal polling behavior. If you prefer to use an SDK, see Transcribe multiple files simultaneously.

The following script ramps submissions to approximate the recommended schedule. The table above is hand-tuned to observed pipeline behavior, so the numbers this script produces may differ by one or two requests per window. Adjust max_rate to your target sustained rate.

1 import time
2 import math
3 import logging
4 from collections import deque
5 
6 import requests
7 
8 logger = logging.getLogger(__name__)
9 
10 API_KEY = "YOUR_API_KEY"
11 BASE_URL = "https://api.assemblyai.com"
12 HEADERS = {"authorization": API_KEY, "content-type": "application/json"}
13 
14 
15 def submit_file(file_url):
16     """Submit a single file for transcription."""
17     data = {
18         # Match your production configuration: model, features, language, etc.
19         "audio_url": file_url,
20         "speech_models": ["universal-3-pro"],
21     }
22     resp = requests.post(f"{BASE_URL}/v2/transcript", headers=HEADERS, json=data)
23     resp.raise_for_status()
24     return resp.json()["id"]
25 
26 
27 def submit_all(files, max_rate):
28     """
29     Submit files in 15-second windows, starting at 25 requests per window
30     and growing ~8–9% per window until reaching max_rate.
31     """
32     rate = 25        # Starting requests per window
33     window = 15      # Seconds per window
34     growth = 1.085   # Per-window growth factor
35 
36     remaining = deque(files)
37     window_num = 0
38 
39     while remaining:
40         window_num += 1
41         batch_size = min(rate, len(remaining))
42 
43         for _ in range(batch_size):
44             file = remaining.popleft()
45             try:
46                 submit_file(file)
47             except Exception:
48                 logger.exception("Failed to submit file: %s", file)
49 
50         logger.info(
51             "Window %d | Rate: %d/window | Submitted: %d | Remaining: %d",
52             window_num, rate, batch_size, len(remaining),
53         )
54 
55         rate = min(math.ceil(rate * growth), max_rate)
56 
57         if remaining:
58             time.sleep(window)
59 
60 
61 # Usage:
62 # file_urls = ["https://example.com/audio1.mp3", ...]
63 # submit_all(file_urls, max_rate=100)  # Target: 100 requests per 15-second window
64 #
65 # This script does not track completion. Use webhooks to receive callbacks,
66 # or add polling via GET /v2/transcript/{id} and account for those requests
67 # in your rate-limit budget.

Quick checklist

Both workloads

Coordinated with AssemblyAI if your run exceeds 200 requests/minute or is a large bulk upload
Audio source is a pre-signed URL where possible
Account balance is sufficient for the full run
Models, features, languages, and audio durations match production
15-second ramp designed (25 requests/window starting rate, ~8–9% growth)
Submission rate stays within your concurrency limit
Webhook endpoint or polling strategy configured
Logging submit_ts, complete_ts, audio_duration, model, features, and status per request

Load tests only

Plan to sustain target rate for at least 5–10 minutes
Separate ramp-phase and sustain-phase metrics
Test window chosen per When to run