October 22, 2025

Large-scale audio transcription: Handling hours of content efficiently

Large-scale audio transcription converts thousands of audio files into accurate, searchable text quickly. Process hours of content efficiently with batch tools.

Speech-to-Text

Automatic Speech Recognition

Kelsey Foster

Growth

Kelsey Foster

Growth

Reviewed by

No items found.

Table of contents

[Visible on live site]

Processing thousands of audio files simultaneously requires a fundamentally different approach than transcribing files one at a time, especially as the audio transcription market continues expanding with AI-powered solutions. This tutorial shows you how to build an asynchronous batch transcription system that handles massive audio libraries—from entire podcast catalogs to years of meeting recordings—in parallel rather than sequentially.

You'll implement this system using Python and the AssemblyAI SDK, covering concurrent job submission, status polling, and multi-format export. The architecture processes unlimited files simultaneously, with total completion time determined by your longest individual file rather than the sum of all files. By the end, you'll have a production-ready system that transforms hours of audio into searchable transcripts, speaker-labeled text, and subtitle files efficiently.

What architecture handles hours of audio efficiently

Large-scale audio transcription is processing thousands of audio files at once instead of one at a time. This means you can transcribe entire podcast libraries or years of meeting recordings simultaneously rather than waiting for each file to finish before starting the next.

The key difference is asynchronous processing versus synchronous processing. Synchronous processing is like washing dishes one by one—you finish washing one dish completely before starting the next. Asynchronous processing is like loading a dishwasher—you put all the dishes in at once and they all get cleaned simultaneously.

Here's what makes async batch transcription work:

Concurrent job submission: Upload and start processing thousands of files at the same time
Status monitoring: Check which jobs are done without stopping the ones still running
Result collection: Gather completed transcripts as they finish, not in any particular order
Error handling: Retry failed jobs without affecting successful ones

You have two ways to track job progress: polling and webhooks. Polling means you periodically check job status yourself. Webhooks mean the transcription service notifies you when jobs complete.

Method	Best Use Case	Complexity	Reliability
Polling	Batch processing, simple setups	Low	High - no external setup needed
Webhooks	Real-time notifications	Medium	Medium - requires endpoint setup

Most batch processing scenarios work better with polling because you control when to check status and don't need to set up external endpoints.

How to implement async batch transcription in Python

You'll build this batch transcription system in five steps. Each step handles a specific part of the process, from setup to final export.

Set up the AssemblyAI Python SDK and authenticate

Install the AssemblyAI SDK using pip:

pip install assemblyai

Set your API key as an environment variable for security:

export ASSEMBLYAI_API_KEY="your-api-key-here"

Create your Python script and set up the client:

import assemblyai as aai import os import json import csv from concurrent.futures import ThreadPoolExecutor, as_completed import time # Set up authentication aai.settings.api_key = os.environ.get("ASSEMBLYAI_API_KEY") transcriber = aai.Transcriber()

Never put your API key directly in your code. Environment variables keep your credentials secure even if someone sees your source code.

Get your API key now

You'll need an AssemblyAI API key to run the Python examples securely. Sign up to authenticate the SDK and start submitting async jobs.

Get free API key

Prepare a manifest of S3/GCS URLs for batch jobs

A manifest is a list of all the audio files you want to transcribe. Think of it like a grocery list—it tells the system exactly what to process.

Your audio files need to be stored online (like Amazon S3 or Google Cloud Storage) so the transcription service can access them. Create a CSV file listing your files:

def create_manifest_from_csv(csv_file): manifest = [] with open(csv_file, 'r') as file: reader = csv.DictReader(file) for row in reader: manifest.append({ 'audio_url': row['audio_url'], 'file_id': row['file_id'], 'duration': float(row.get('duration_estimate', 0)) }) return manifest

Your CSV should look like this:

audio_url,file_id,duration_estimate https://storage.example.com/meeting1.mp3,meeting_001,3600 https://storage.example.com/meeting2.mp3,meeting_002,2400

For more complex setups, use JSON instead:

def create_manifest_from_json(json_file): with open(json_file, 'r') as file: return json.load(file) # JSON structure example: [ { "audio_url": "https://storage.example.com/podcast1.mp3", "file_id": "episode_001", "duration": 3600 } ]

Submit async transcription jobs concurrently

Now you'll submit all your transcription jobs at once. This uses Python's ThreadPoolExecutor to handle multiple submissions simultaneously:

def submit_transcription_job(audio_item, config): try: transcript = transcriber.submit( audio_item['audio_url'], config=config ) return { 'file_id': audio_item['file_id'], 'job_id': transcript.id, 'status': transcript.status, 'audio_url': audio_item['audio_url'] } except Exception as e: return { 'file_id': audio_item['file_id'], 'error': str(e), 'audio_url': audio_item['audio_url'] } def submit_batch_jobs(manifest, max_workers=50): # Configure what features you want config = aai.TranscriptionConfig( speaker_labels=True, # Identify different speakers punctuate=True, format_text=True ) submitted_jobs = [] with ThreadPoolExecutor(max_workers=max_workers) as executor: futures = [executor.submit(submit_transcription_job, item, config) for item in manifest] for future in as_completed(futures): result = future.result() submitted_jobs.append(result) if 'error' in result: print(f"Failed: {result['file_id']}") else: print(f"Submitted: {result['file_id']}") return submitted_jobs

The max_workers=50 setting controls how many jobs you submit at once. Start with 50 and adjust based on your internet connection and system performance.

Key configuration options:

speaker_labels=True: Identifies who's speaking when
punctuate=True: Adds proper punctuation
format_text=True: Capitalizes sentences correctly

Poll job statuses and collect transcripts

After submitting jobs, you need to check when they're done and collect the results. This polling function checks all your jobs periodically:

def poll_and_collect_results(submitted_jobs, poll_interval=30): completed_transcripts = [] pending_jobs = [job for job in submitted_jobs if 'error' not in job] while pending_jobs: print(f"Checking {len(pending_jobs)} jobs...") still_pending = [] for job in pending_jobs: try: transcript = transcriber.get_transcript(job['job_id']) if transcript.status == aai.TranscriptStatus.completed: result = { 'file_id': job['file_id'], 'job_id': job['job_id'], 'text': transcript.text, 'speakers': transcript.utterances, 'words': transcript.words, 'confidence': transcript.confidence } completed_transcripts.append(result) print(f"Completed: {job['file_id']}") elif transcript.status == aai.TranscriptStatus.error: print(f"Failed: {job['file_id']} - {transcript.error}") else: still_pending.append(job) except Exception as e: print(f"Error checking {job['file_id']}: {e}") still_pending.append(job) pending_jobs = still_pending if pending_jobs: time.sleep(poll_interval) return completed_transcripts

This function keeps checking until all jobs are done. It waits 30 seconds between checks by default—you can make this shorter for faster updates or longer to reduce API calls.

Export JSON, JSONL, SRT, VTT with speaker diarization and timestamps

Once transcription finishes, export your results in the format you need. Different formats serve different purposes:

Format	Use Case	Speaker Info	Timestamps
JSON	Data analysis, searching	Full speaker details	Word-level precision
TXT	Simple reading	None	None
SRT	Video subtitles	Limited	Segment-level
VTT	Web video captions	Limited	Segment-level

Here's how to export in multiple formats:

def export_transcripts(completed_transcripts, output_dir): os.makedirs(output_dir, exist_ok=True) for transcript in completed_transcripts: file_id = transcript['file_id'] # Export as JSON (includes everything) json_path = os.path.join(output_dir, f"{file_id}.json") with open(json_path, 'w') as f: json.dump(transcript, f, indent=2) # Export as plain text txt_path = os.path.join(output_dir, f"{file_id}.txt") with open(txt_path, 'w') as f: f.write(transcript['text']) # Export with speaker labels if transcript.get('speakers'): speakers_path = os.path.join(output_dir, f"{file_id}_speakers.txt") with open(speakers_path, 'w') as f: for utterance in transcript['speakers']: f.write(f"Speaker {utterance.speaker}: {utterance.text}\n")

For subtitle files (SRT format), you need to convert word-level timestamps into subtitle segments:

def export_as_srt(words, output_path): segments = [] current_segment = [] segment_start = None for word in words: if segment_start is None: segment_start = word.start / 1000 current_segment.append(word.text) # Create 3-second segments if (word.end / 1000) - segment_start >= 3.0: segments.append({ 'start': segment_start, 'end': word.end / 1000, 'text': ' '.join(current_segment) }) current_segment = [] segment_start = None # Write SRT file with open(output_path, 'w') as f: for i, segment in enumerate(segments, 1): start = format_timestamp(segment['start']) end = format_timestamp(segment['end']) f.write(f"{i}\n{start} --> {end}\n{segment['text']}\n\n") def format_timestamp(seconds): hours = int(seconds // 3600) minutes = int((seconds % 3600) // 60) secs = int(seconds % 60) millis = int((seconds % 1) * 1000) return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"

Try transcription features in Playground

Upload an audio file to validate speaker labels, punctuation, and formatting before integrating code. Experiment with JSON outputs and captions interactively.

Open playground

How to plan throughput and cost for large batches

Planning your batch processing means understanding how long it takes and how much it costs. The good news: processing time doesn't increase much with more files.

When you process files concurrently, your total time equals roughly the longest individual file. A thousand 1-hour files finish in about the same time as a single 1-hour file—around 15-20 minutes.

Here's how to estimate costs and timing:

def estimate_batch_cost(manifest, price_per_minute=0.01): total_minutes = sum(item.get('duration', 0) for item in manifest) / 60 estimated_cost = total_minutes * price_per_minute return { 'total_files': len(manifest), 'total_minutes': round(total_minutes), 'estimated_cost': round(estimated_cost, 2) }

Cost optimization tips:

Only enable features you need: Speaker diarization and word timestamps add processing time
Group similar content: Process similar audio types together for consistency
Retry failed jobs: Build in automatic retry logic for network issues

Batch Size	Typical Completion Time	Estimated Cost (at $0.01/min)
100 files (1 hour each)	15-20 minutes	$60
1,000 files (30 minutes each)	10-15 minutes	$500

Final Words

This async batch transcription approach transforms massive audio libraries into searchable text in hours instead of weeks. You submit thousands of jobs simultaneously, poll for completion, and export results in whatever format your project needs.

AssemblyAI's unlimited concurrency removes the bottlenecks that slow down other transcription services, while industry-leading accuracy handles challenging audio conditions like background noise and diverse accents. The straightforward pricing model makes large-scale projects predictable—you pay per minute of audio with no hidden fees for parallel processing.

Scale batch transcription without limits

Build your production pipeline on AssemblyAI—submit thousands of files in parallel, monitor status, and export in the formats you need. Create an account to start.

Start building

FAQ

Can I process more than 10,000 audio files at once with this approach?

Yes, this batch processing method scales to tens of thousands of files since you're not limited by concurrent job restrictions.

What happens if some audio files fail to transcribe during batch processing?

Failed jobs return error messages in your results, and you can automatically retry them or process them separately without affecting successful transcriptions.

Which audio file formats work best for large-scale batch transcription?

MP3, WAV, and M4A files provide the most reliable processing, with MP3 offering the best balance of quality and file size for batch uploads.

How accurate is speaker identification when processing thousands of files with different audio quality?

Speaker diarization accuracy depends on audio quality and number of speakers, typically working well for 2-6 speakers in clear audio but may struggle with overlapping speech or poor quality recordings.

Does enabling word-level timestamps significantly slow down large batch processing?

Word-level timestamps add minimal processing time to batch jobs, usually increasing completion time by only a few minutes per hour of audio.

Large-scale audio transcription: Handling hours of content efficiently

What architecture handles hours of audio efficiently

How to implement async batch transcription in Python

Set up the AssemblyAI Python SDK and authenticate

Prepare a manifest of S3/GCS URLs for batch jobs

Submit async transcription jobs concurrently

Poll job statuses and collect transcripts

Export JSON, JSONL, SRT, VTT with speaker diarization and timestamps

How to plan throughput and cost for large batches

Final Words

FAQ

Can I process more than 10,000 audio files at once with this approach?

What happens if some audio files fail to transcribe during batch processing?

Which audio file formats work best for large-scale batch transcription?

How accurate is speaker identification when processing thousands of files with different audio quality?

Does enabling word-level timestamps significantly slow down large batch processing?

How to remove or reduce background noise from audio for (stt) transcription

How do I transcribe audio in languages like Spanish, French, or German?

Top 9 AI notetakers in 2026: Compare features, pricing, and accuracy

Speech-to-text API accuracy for phone call transcription

Transcribe audio and video files with Python and Universal

Build a real-time medical transcription analysis app with AssemblyAI and LLM Gateway

Building Standout Hybrid Event Solutions with AI Models

Top sales coaching software in 2025

Large-scale audio transcription: Handling hours of content efficiently

What architecture handles hours of audio efficiently

How to implement async batch transcription in Python

Set up the AssemblyAI Python SDK and authenticate

Prepare a manifest of S3/GCS URLs for batch jobs

Submit async transcription jobs concurrently

Poll job statuses and collect transcripts

Export JSON, JSONL, SRT, VTT with speaker diarization and timestamps

How to plan throughput and cost for large batches

Final Words

FAQ

Can I process more than 10,000 audio files at once with this approach?

What happens if some audio files fail to transcribe during batch processing?

Which audio file formats work best for large-scale batch transcription?

How accurate is speaker identification when processing thousands of files with different audio quality?

Does enabling word-level timestamps significantly slow down large batch processing?

Related posts

How to remove or reduce background noise from audio for (stt) transcription

How do I transcribe audio in languages like Spanish, French, or German?

Top 9 AI notetakers in 2026: Compare features, pricing, and accuracy

Speech-to-text API accuracy for phone call transcription

Transcribe audio and video files with Python and Universal

Build a real-time medical transcription analysis app with AssemblyAI and LLM Gateway

Building Standout Hybrid Event Solutions with AI Models

Top sales coaching software in 2025