Running Bulk Transcription and Load Tests at Scale

This guide applies to two closely related workloads:

  • Bulk transcription. Submitting thousands of files in a batch — for example, a nightly backfill or a one-time migration.
  • Load testing. Measuring turnaround time (TaT) and throughput before a production cutover.

The guidance is shared. If you’re running a load test, also read the load test subsection of Measure and verify.

Key recommendations

CategoryRecommendation
RampSubmit in 15-second windows. Start at 25 requests/window and grow ~8–9% per window until you reach your target sustained rate.
MeasureUse webhooks instead of polling. Record submit_ts, complete_ts, audio_duration, model, features, and status per request.
CoordinateRecommended for runs above 200 requests/minute, required for large bulk uploads (tens of thousands of files), and for any EU workload.

Default concurrency for paid accounts is 200 concurrent jobs. If you need a higher limit, reach out to support@assemblyai.com — AssemblyAI offers custom concurrency limits at no additional cost.

Before you begin

Prerequisites

  • Account balance. If your balance hits zero mid-run, AssemblyAI drops your concurrency limit to 1 and invalidates your results. Bulk runs and load tests both incur standard transcription usage charges — fund your account for the full expected volume before you submit any requests.
  • Concurrency limit. Check your current limit on the Rate Limits page of your dashboard and size your target submission rate against it — see Size your target rate.
  • Audio source. Prefer pre-signed URLs (for example, from S3 or GCS) as your audio_url. Each /v2/upload call counts against your HTTP rate limit and adds latency proportional to file size — at bulk scale, uploads alone can exhaust your rate-limit budget. If local files are your only option, upload them to /v2/upload first to get a hosted URL, and factor those uploads into your rate-limit planning.
  • Pre-signed URL expiration. Set URL TTLs long enough to outlast your expected queue time plus turnaround time. One hour is a safe default for small-to-moderate runs; use two hours or more for runs that approach your concurrency limit or use less-common languages. URLs that expire while a job is queued or processing surface as 4xx errors when AssemblyAI tries to retrieve the audio.
  • Completion tracking. Configure webhooks (with a polling fallback) or a polling-only strategy before you start — you’ll need a way to detect completion and record per-request timestamps.

Workload configuration

Match these to your expected production traffic:

  • Region: US or EU. The EU region handles less traffic than US and is more sensitive to load spikes. Coordinate with our team for any EU workload, regardless of size.
  • Traffic pattern. Requests per minute at peak and whether traffic is steady or bursty.
  • Models and features. Universal-3 Pro, Universal-2, speaker labels, PII redaction, summarization, and so on — each has different processing characteristics, and every feature you enable adds to TaT. Audit your request body and turn off what you don’t need; a faster run is also a cheaper one.
  • Language. Some languages, such as Hindi, Swedish, and Hebrew, scale differently and may show longer TaT. Coordinate with our team if your workload is primarily in a less common language.
  • Audio format. Format conversion and preprocessing run before transcription and contribute to overall TaT.
  • Cost estimate. Sanity-check the expected spend before you submit: total audio hours × your per-hour rate. Bulk runs are billed the same as any other transcription, so a large backfill can produce a surprisingly large invoice if you haven’t projected it.

Coordinate with our team

Reach out to support@assemblyai.com or your account manager before you submit any requests if any of these apply:

  • You plan to exceed 200 requests per minute.
  • You’re running a large one-time upload (tens of thousands of files or more).
  • You want support available during the run — for example, if you’re running outside US business hours.
  • You’re using the EU region, regardless of size.

When you reach out, include:

  • Expected request volume and ramp schedule, broken into 15-second windows
  • Audio file durations and language breakdown
  • Speech models and features you’ll enable
  • Whether audio is single-channel or multichannel
  • Preferred run window (see When to run)

AssemblyAI can pre-scale pipeline components for your traffic, raise your concurrency limit, and monitor the run in real time. For recurring bulk workloads (for example, nightly batch jobs), we can set up persistent scaling.

Pilot first

Before a full bulk run or load test, submit a pilot batch of 50–200 files using the exact configuration you plan to use at scale — same model, same features, same language, same webhook receiver, same error-handling logic. A pilot verifies that:

  • The transcripts look right, and the model and feature set match what the downstream consumer expects.
  • Your webhook receiver is reachable, verifying signatures, and writing results durably — or your polling loop is keeping up without hitting rate limits.
  • Your retry logic handles 5xx correctly and your dead-letter path captures 4xx without silently dropping files.
  • Your ramp and concurrency controls behave as intended.

The most expensive bulk-run failures almost always come from discovering a configuration mistake — a wrong model, a feature flag left off, a webhook handler that drops results silently — after the whole batch has been billed. A pilot catches these while they’re cheap to fix, and gives you a realistic mean TaT to plug into Size your target rate.

When to run

  • Small tests and moderate bulk runs (well within your concurrency limit): US business hours (roughly 14:00–21:00 UTC) produce the most representative baseline latency, since throughput is highest during these periods.
  • Large runs (200+ requests/minute): coordinate with our team before starting. Our team will pick a window and pre-scale for you.
  • EU region: coordinate regardless of size.

Ramp up gradually

The most common mistake — for both bulk uploads and load tests — is submitting all requests at once. A gradual ramp gives the pipeline time to scale ahead of your traffic, which is what produces the lowest and most consistent turnaround times. Submitting a large spike upfront typically results in higher TaT as capacity catches up.

Recommended schedule (validated for Universal-3 Pro with speaker labels):

  • Divide your ramp into 15-second windows.
  • Start at 25 requests per window.
  • Grow by ~8–9% per window until you reach your target sustained rate.
  • Don’t pause mid-ramp. Stopping and restarting means ramping from the starting rate again, and you’ll see higher latency when traffic resumes.
  • Do not exceed your account’s concurrency limit during the ramp.

If you’re using different models or features, contact support for a tailored ramp plan — some components take longer to initialize and may need a slower ramp.

Example: ramp to 100 requests/window (400/minute)

Time windowRequestsCumulative
0:00–0:152525
0:15–0:302752
0:30–0:452981
0:45–1:0031112
1:00–1:1533145
1:15–1:3035180
1:30–1:4538218
1:45–2:0041259
2:00–2:1544303
2:15–2:3047350
2:30–2:4551401
2:45–3:0055456
3:00–3:1559515
3:15–3:3064579
3:30–3:4569648
3:45–4:0075723
4:00–4:1581804
4:15–4:3088892
4:30–4:4595987
4:45–5:001001,087

To compute a ramp for any target without referring to the table, use rate_n = ceil(25 × 1.085ⁿ) (capped at your target rate), where n is the window index starting at 0.

After reaching your target rate, sustain it for at least 5–10 minutes. Your measurements are representative once p50 and p95 stay consistent over 2–3 consecutive minutes. If p50 is still falling, extend the sustain phase.

Size your target rate

Your concurrency limit caps how many jobs can be in progress at once, so your sustained submission rate needs to fit inside it. Pick a target rate that keeps the typical number of in-flight jobs comfortably below your limit:

  1. Estimate mean turnaround time from your pilot run or the published benchmarks.
  2. Multiply your target submission rate (requests per second) by that mean TaT to approximate the number of jobs that will be in flight at steady state.
  3. Keep that number under ~80% of your concurrency limit. If it’s higher, lower the target rate, shorten mean TaT (fewer features, shorter audio, a faster model), or request a higher concurrency limit.

The 20% of headroom absorbs normal variation in audio duration, warm-up effects, and webhook-receive latency. Runs sized to the limit exactly will see TaT climb as the in-flight count bumps against the cap.

Measure and verify

For every run — bulk or load test — track completion and record per-request metadata.

  • Use webhooks when possible. You get a clean completion signal without polling overhead. See the Webhooks documentation for setup, retry behavior, and authentication.
  • Run a polling fallback alongside webhooks. Webhooks can drop for many reasons — receiver downtime, signature mismatches, transient network failures. For every submitted job, record an expected completion deadline (around 2× mean TaT from your pilot) and GET /v2/transcript/{id} for any job whose webhook hasn’t arrived by then. The fallback protects you from silent data loss without materially increasing your rate-limit usage.
  • Polling-only every 1–2 seconds measures closer to actual completion but adds to your rate-limit budget. Use it when webhooks aren’t available or when you need precise TaT during a load test.
  • Retry failures with exponential backoff for 5xx responses — see Implement retry server error logic. Investigate 4xx responses; they indicate a client-side issue that retrying won’t fix.

Record these fields per request:

  • submit_ts — timestamp when POST /v2/transcript was sent
  • complete_ts — timestamp when completion was detected
  • audio_duration — length of the audio file, in seconds
  • model — speech model used
  • features — features enabled (e.g. speaker_labels, auto_highlights, sentiment_analysis)
  • statuscompleted or error
  • id — transcript ID, for debugging with support

Turnaround time = complete_tssubmit_ts.

Normalize TaT by audio duration to get the real-time factor: RTF = turnaround_time ÷ audio_duration. An RTF of 0.5 means the API processed the file in half its audio duration. RTF is the headline metric for comparing runs across regions, models, and audio-duration buckets — raw TaT varies too much with audio length to mean anything on its own.

Polling without exceeding the rate limit

HTTP rate limits cap total API requests at 20,000 per 5 minutes across all endpoints — submissions and polling combined. Exceeding this returns a 403 error.

If webhooks aren’t an option, stay within that budget. As an illustration: at a sustained 33 requests/second submission rate with ~330 jobs in flight, submissions alone consume roughly 10,000 of your budget — so poll no more often than every 15 seconds to leave headroom. Scale these numbers to your own submission rate and in-flight job count:

Polling intervalGETs/s at 330 in-flightTotal req/s (at 33 POST/s)Within limit?
Every 3s~110~143No
Every 5s~66~99No
Every 10s~33~66Yes
Every 15s~22~55Yes

When many jobs share the same polling interval they tend to cluster at the same second boundaries, spiking your rate-limit usage and occasionally returning 403 errors. Stagger each job’s polling by ±25% of the interval (for example, interval × random.uniform(0.75, 1.25)) so GETs spread evenly across the window.

If you’re running a bulk job

Monitor for these signals during the run:

  • Healthy run: TaT stays within ~20% of your first 5 minutes of sustained submissions, your completion queue drains steadily, and no errors arrive.
  • Diagnose, don’t panic: if something goes wrong, match the signal you see to the row in Diagnosing problems and respond accordingly — each signal has a different cause and fix.

If you’re running a load test

  • Separate ramp-phase from sustain-phase metrics. Expect higher latency during ramp; use sustain-phase numbers as your benchmark.
  • Report percentiles, not averages. Track p50, p75, p90, p95, p99, and max.
  • Normalize by audio duration. Group results into duration buckets (0–5 min, 5–15 min, 15–30 min, 30–60 min) for meaningful comparison.
  • Pre-scaling caveat. If our team pre-scaled for your test, your results reflect steady-state capacity — not cold-start or scale-up behavior.

Diagnosing problems

SignalMeaningResponse
403 HTTP errorYou’ve exceeded the 20,000 requests per 5 minutes HTTP rate limit (polling counts)Increase polling interval or switch to webhooks; slow your submission rate
4xx submission error (other than 403)Client-side issue (bad request, auth, invalid audio URL, etc.)Inspect the response body and fix the request; retrying won’t help
5xx submission errorTransient server-side issueRetry with exponential backoff — see the retry guide
TaT rises with no errorsSubmission rate is outpacing available capacitySlow the ramp or extend its duration; verify you’re within your concurrency limit

Reference implementation

When managing a large rate-limit budget, calling the API directly gives you more control than the SDKs’ internal polling behavior. If you prefer to use an SDK, see Transcribe multiple files simultaneously.

The following script ramps submissions to approximate the recommended schedule, retries transient server errors, writes unrecoverable failures to a dead-letter log, and persists submitted file → transcript_id pairs so the run is resumable after a crash. The table above is hand-tuned to observed pipeline behavior, so the numbers this script produces may differ by one or two requests per window. Adjust max_rate to your target sustained rate.

1import json
2import logging
3import math
4import random
5import time
6from collections import deque
7from pathlib import Path
8
9import requests
10
11logger = logging.getLogger(__name__)
12
13API_KEY = "YOUR_API_KEY"
14BASE_URL = "https://api.assemblyai.com"
15HEADERS = {"authorization": API_KEY, "content-type": "application/json"}
16
17STATE_PATH = Path("bulk_state.jsonl") # Append-only submission log, for resume
18FAILURES_PATH = Path("bulk_failures.jsonl") # Dead-letter log for 4xx and unrecoverable 5xx
19
20
21def append_jsonl(path, entry):
22 with path.open("a") as f:
23 f.write(json.dumps(entry) + "\n")
24
25
26def already_submitted(path):
27 """Return the set of file URLs successfully submitted in previous runs."""
28 if not path.exists():
29 return set()
30 with path.open() as f:
31 return {json.loads(line)["file"] for line in f if line.strip()}
32
33
34def submit_file(file_url, max_retries=3):
35 """POST one file. Retries 5xx and connection errors with exponential backoff + jitter."""
36 data = {
37 # Match your production configuration: model, features, language, etc.
38 "audio_url": file_url,
39 "speech_models": ["universal-3-pro"],
40 }
41 for attempt in range(max_retries + 1):
42 try:
43 resp = requests.post(
44 f"{BASE_URL}/v2/transcript",
45 headers=HEADERS,
46 json=data,
47 timeout=30,
48 )
49 except requests.RequestException:
50 if attempt == max_retries:
51 raise
52 time.sleep(2 ** attempt + random.random())
53 continue
54
55 if 500 <= resp.status_code < 600 and attempt < max_retries:
56 time.sleep(2 ** attempt + random.random())
57 continue
58 resp.raise_for_status()
59 return resp.json()["id"]
60
61
62def submit_all(files, max_rate):
63 """
64 Submit files in 15-second windows, starting at 25 requests per window
65 and growing ~8–9% per window until reaching max_rate.
66
67 Files already recorded in STATE_PATH are skipped on restart. Failed
68 submissions are written to FAILURES_PATH for inspection and manual replay.
69 """
70 done = already_submitted(STATE_PATH)
71 remaining = deque(f for f in files if f not in done)
72
73 rate = 25 # Starting requests per window
74 window = 15 # Seconds per window
75 growth = 1.085 # Per-window growth factor
76 window_num = 0
77
78 while remaining:
79 window_num += 1
80 batch_size = min(rate, len(remaining))
81
82 for _ in range(batch_size):
83 file = remaining.popleft()
84 try:
85 transcript_id = submit_file(file)
86 append_jsonl(STATE_PATH, {"file": file, "id": transcript_id})
87 except requests.HTTPError as exc:
88 status = exc.response.status_code if exc.response is not None else None
89 logger.exception("Submission failed (status=%s): %s", status, file)
90 append_jsonl(
91 FAILURES_PATH,
92 {"file": file, "http_status": status, "error": str(exc)},
93 )
94 except Exception as exc:
95 logger.exception("Submission failed: %s", file)
96 append_jsonl(FAILURES_PATH, {"file": file, "error": str(exc)})
97
98 logger.info(
99 "Window %d | Rate: %d/window | Submitted: %d | Remaining: %d",
100 window_num, rate, batch_size, len(remaining),
101 )
102
103 rate = min(math.ceil(rate * growth), max_rate)
104
105 if remaining:
106 time.sleep(window)
107
108
109# Usage:
110# file_urls = ["https://example.com/audio1.mp3", ...]
111# submit_all(file_urls, max_rate=100) # Target: 100 requests per 15-second window
112#
113# Pick max_rate using the guidance in "Size your target rate".
114# Use webhooks (with a polling fallback) to track completion; account for GETs
115# in your rate-limit budget.

Quick checklist

Before the run

  • Account balance is sufficient for the full run
  • Models, features, languages, and audio durations match production
  • Audio source is a pre-signed URL with a TTL longer than queue time + TaT
  • Pilot batch (50–200 files) completed end-to-end with the same configuration
  • Target rate sized against your concurrency limit — see Size your target rate
  • 15-second ramp designed (25 requests/window starting rate, ~8–9% growth)
  • Webhook endpoint configured, with a polling fallback for missed callbacks
  • Resumable submission log + dead-letter log in place
  • Coordinated with AssemblyAI if your run exceeds 200 requests/minute, is a large bulk upload, uses the EU region, or you want support available during the run

During the run

  • Logging submit_ts, complete_ts, audio_duration, model, features, and status per request
  • Monitoring the healthy-run signals; ready to match against Diagnosing problems if anything drifts

Load tests only

  • Plan to sustain target rate for at least 5–10 minutes
  • Separate ramp-phase and sustain-phase metrics
  • Reporting RTF alongside TaT percentiles
  • Test window chosen per When to run