October 22, 2025

Video transcription made simple: From segments to timestamps

AI video transcription converts your videos to accurate, timestamped text in minutes. Transcribe files to TXT, SRT, or VTT for search, captions, and editing.

Speech-to-Text

Automatic Speech Recognition

Kelsey Foster

Growth

Kelsey Foster

Growth

Reviewed by

No items found.

Table of contents

[Visible on live site]

Video transcription made simple: From segments to timestamps

This tutorial shows you how to build a complete video transcription system that converts spoken words into timestamped text using Python and AssemblyAI's API. You'll create a solution that processes video files and generates multiple output formats—plain text for documentation, SRT files for video editing, and VTT files for web players. The system handles everything from single video files to production-scale batch processing with speaker identification.

You'll use AssemblyAI's Python SDK for async transcription, Python's built-in file handling for format conversion, and optional concurrent processing libraries for scaling to multiple videos. The implementation covers segment extraction, timestamp formatting, and export functions that produce industry-standard caption files compatible with major video editing platforms and streaming services.

What is AI video transcription and why timestamps matter

AI video transcription converts spoken words in your video files into text with precise timestamps, serving a video transcription market valued at $30.42 billion in 2024. This means each piece of text includes the exact start and end times when those words were spoken in your video. You get searchable text plus the timing data needed to create captions, jump to specific moments, or build interactive video experiences.

The timestamp precision is what separates basic transcription from video-ready output. Without accurate timing, you can't sync captions with speech or create clickable transcripts that jump to specific video moments.

Here's what different timestamp formats enable:

Format	Use Case	Timestamp Precision	Compatibility
TXT	Search indexing, documentation	None or paragraph-level	Universal text editors
SRT	Video editing, social platforms	Millisecond (00:00:00,000)	Premiere, Final Cut, YouTube
VTT	Web video players	Millisecond (00:00:00.000)	HTML5 video, streaming platforms
JSON	Custom applications	Millisecond with metadata	APIs, databases, analytics

The segments structure makes your video content actionable. You can search for specific phrases and jump directly to that moment, generate subtitles that appear at the right time, or identify key topics for video chapters.

Transcribe a video to text with Python

You'll use AssemblyAI's Python SDK to transcribe video files asynchronously. This approach handles large files efficiently and provides detailed progress tracking while you build a complete solution that processes videos and generates multiple output formats.

The workflow has five steps: install the SDK, submit videos for transcription, retrieve timestamped segments, export to different formats, and add speaker identification when needed.

Step 1. Install and set up the Python SDK

Start by installing the AssemblyAI Python package. You'll need Python 3.8 or higher and an API key from your AssemblyAI dashboard.

# Install the SDK pip install assemblyai # Create a new Python file: video_transcriber.py import assemblyai as aai import json from pathlib import Path # Configure your API key aai.settings.api_key = "your-api-key-here" # Initialize the transcriber client transcriber = aai.Transcriber()

Store your API key as an environment variable for security:

# Install python-dotenv pip install python-dotenv # Updated imports import os from dotenv import load_dotenv import assemblyai as aai # Load environment variables load_dotenv() aai.settings.api_key = os.getenv("ASSEMBLYAI_API_KEY")

Create a .env file in your project directory and add your API key:

ASSEMBLYAI_API_KEY=your-actual-api-key-here

Step 2. Submit the video for async transcription

The transcriber accepts both local files and URLs to videos hosted online. For local files, the SDK handles uploading automatically, while cloud-hosted videos process faster since there's no upload step.

def transcribe_video(video_path, language_code="en"): """ Submit a video for transcription with automatic language detection Args: video_path: Path to local video file or URL language_code: ISO language code or None for auto-detection Returns: Transcript object with segments and metadata """ # Configure transcription settings config = aai.TranscriptionConfig( language_code=language_code, punctuate=True, format_text=True ) # Submit for async transcription transcript = transcriber.transcribe(video_path, config=config) # The SDK automatically polls for completion if transcript.status == aai.TranscriptStatus.error: raise Exception(f"Transcription failed: {transcript.error}") return transcript # Example usage with a local file video_file = "meeting_recording.mp4" transcript = transcribe_video(video_file) print(f"Transcription ID: {transcript.id}") print(f"Duration: {transcript.audio_duration} seconds") print(f"Confidence: {transcript.confidence}")

The async approach handles videos of any length efficiently. The SDK manages polling intervals and connection handling, so you don't need to implement retry logic yourself.

Get your API key to start transcribing

You just submitted async jobs with the Python SDK. Create an account to get an API key and run the code on your own videos.

Get free API key

Step 3. Retrieve segments with timestamps

The transcript object contains a words array with detailed timing for each word. You'll use the SDK's built-in methods to group these into meaningful segments for caption generation.

def get_timestamped_segments(transcript): """ Extract segments with precise timestamps using SDK methods Returns: List of segments with start, end, and text fields """ # Use built-in SDK methods for accurate segmentation sentences = transcript.get_sentences() segments = [] for sentence in sentences: segment = { "start": sentence.start, "end": sentence.end, "text": sentence.text, "confidence": sentence.confidence } segments.append(segment) return segments # For larger segments, use paragraphs instead def get_paragraph_segments(transcript): """ Get paragraph-level segments for longer captions """ paragraphs = transcript.get_paragraphs() segments = [] for paragraph in paragraphs: segment = { "start": paragraph.start, "end": paragraph.end, "text": paragraph.text, "confidence": paragraph.confidence } segments.append(segment) return segments # Extract segments from your transcript segments = get_timestamped_segments(transcript) # Display first few segments for i, segment in enumerate(segments[:3]): start_time = segment["start"] / 1000 # Convert to seconds end_time = segment["end"] / 1000 print(f"[{start_time:.2f}s - {end_time:.2f}s]: {segment['text']}")

These millisecond-precision timestamps ensure your captions sync perfectly with the audio. The confidence scores help you identify sections that might need manual review.

Step 4. Export TXT, SRT, and VTT

Transform the segments into standard caption formats using the SDK's built-in export methods. Each format has specific timestamp requirements and syntax rules that video players expect.

def export_subtitles(transcript, base_filename="output"): """ Generate subtitle files using SDK's built-in methods """ # Export SRT format for video editing software srt_content = transcript.export_subtitles_srt() with open(f"{base_filename}.srt", "w", encoding="utf-8") as f: f.write(srt_content) print(f"SRT file saved: {base_filename}.srt") # Export VTT format for web players vtt_content = transcript.export_subtitles_vtt() with open(f"{base_filename}.vtt", "w", encoding="utf-8") as f: f.write(vtt_content) print(f"VTT file saved: {base_filename}.vtt") # Export plain text transcript with open(f"{base_filename}.txt", "w", encoding="utf-8") as f: f.write(transcript.text) print(f"Text file saved: {base_filename}.txt") # Generate all output formats with one function call export_subtitles(transcript, "my_video") # You can also customize subtitle length if needed def export_custom_subtitles(transcript, max_chars=50): """ Export subtitles with custom character limits """ srt_content = transcript.export_subtitles_srt(chars_per_caption=max_chars) vtt_content = transcript.export_subtitles_vtt(chars_per_caption=max_chars) return srt_content, vtt_content

The key differences in timestamp formatting matter for compatibility:

SRT format: Uses comma-separated milliseconds (00:00:00,000)
VTT format: Uses decimal seconds (00:00:00.000)
Plain text: Contains no timing information

Step 5. Optional speaker diarization

For videos with multiple speakers, enable speaker diarization to identify who's talking. This adds speaker labels to each segment, which is essential for interviews, meetings, and panel discussions.

def transcribe_with_speakers(video_path): """ Transcribe video with speaker identification """ config = aai.TranscriptionConfig( speaker_labels=True, punctuate=True, format_text=True ) transcript = transcriber.transcribe(video_path, config=config) # Process utterances with speaker labels segments_with_speakers = [] for utterance in transcript.utterances: segment = { "speaker": utterance.speaker, "start": utterance.start, "end": utterance.end, "text": utterance.text, "confidence": utterance.confidence } segments_with_speakers.append(segment) return segments_with_speakers # Generate SRT with speaker labels def export_srt_with_speakers(segments, output_file="output_speakers.srt"): def format_srt_time(milliseconds): seconds = milliseconds / 1000 hours = int(seconds // 3600) minutes = int((seconds % 3600) // 60) secs = int(seconds % 60) millis = int((milliseconds % 1000)) return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}" with open(output_file, "w", encoding="utf-8") as f: for i, segment in enumerate(segments, 1): f.write(f"{i}\n") start = format_srt_time(segment["start"]) end = format_srt_time(segment["end"]) f.write(f"{start} --> {end}\n") f.write(f"[Speaker {segment['speaker']}]: {segment['text'].strip()}\n\n")

Speaker diarization adds minimal processing time but significantly improves transcript readability for multi-speaker content. You'll get labels like "Speaker A" and "Speaker B" automatically assigned to different voices.

Choose the right output format

Your choice of transcript format depends on where the content will be displayed and what features you need. Different video workflows require specific formats that support their unique requirements.

TXT for documentation and search: Plain text transcripts work best for documentation, meeting notes, and content indexing. They're searchable, easy to edit, and compatible with any text processing system. Use TXT when you need the content without timing information.

SRT for video editing: SubRip (.srt) files are the industry standard for video editing software. Adobe Premiere, Final Cut Pro, and DaVinci Resolve all import SRT files natively. Social platforms like YouTube, Facebook, and LinkedIn also accept SRT for caption uploads.

VTT for web players: WebVTT (.vtt) files are designed for HTML5 video elements and streaming platforms. They support advanced styling options and metadata that SRT doesn't offer.

The key distinctions for each format:

TXT: No timestamps, universal compatibility, best for text analysis
SRT: Comma-separated milliseconds, works with all video editors
VTT: Decimal seconds, supports web styling and cue metadata
JSON: Raw segment data, perfect for custom applications

Accuracy and performance for video transcription

Transcription accuracy directly impacts the usability of your timestamps and captions. Poor accuracy leads to misaligned captions, incorrect segment boundaries, and frustrated viewers who can't follow along with your content.

Three key factors determine transcription quality:

Audio quality: Clear audio with minimal background noise produces the most accurate timestamps
Language detection: Correct language identification ensures proper word boundaries and punctuation
Proper noun handling: Names, brands, and technical terms often cause segment splitting errors

Audio quality affects timestamp precision more than you might expect. Compressed audio from social media downloads or low-bitrate recordings can cause timestamp drift where captions gradually fall out of sync with the actual speech.

Language detection ensures the AI model uses the right pronunciation and grammar rules. Auto-detection works well for monolingual content, but you should specify the language code for mixed-language videos or technical content with industry jargon.

Proper noun recognition maintains segment integrity even with challenging terminology. AssemblyAI's Universal models excel at recognizing company names, product references, and technical terms without breaking them across multiple segments. This prevents awkward caption breaks in the middle of important names or phrases.

Modern speech-to-text systems go beyond simple word recognition, using advanced AI and machine learning algorithms to achieve accuracy rates often exceeding 95%. The AI understands context, maintains consistency across similar-sounding words, and properly formats numbers and dates for better readability.

Scale to production with async queues

Production applications need to handle multiple videos efficiently without blocking your application, reflecting the rapid growth of the AI-powered meeting assistants market from $2.68 billion in 2024 to a projected $24.6 billion by 2034. The async pattern scales naturally—you can submit multiple videos simultaneously and process results as they complete. AssemblyAI supports scalable concurrent transcriptions with limits based on your account type. Free accounts are limited to 5 concurrent jobs, while paid accounts start at 200 and can request higher limits by contacting support.

import asyncio from concurrent.futures import ThreadPoolExecutor import assemblyai as aai class VideoTranscriptionQueue: def __init__(self, max_concurrent=5): self.transcriber = aai.Transcriber() self.max_concurrent = max_concurrent self.executor = ThreadPoolExecutor(max_workers=max_concurrent) def process_video(self, video_info): """ Process a single video and store results """ video_path = video_info["path"] output_dir = video_info["output_dir"] # Create output directory Path(output_dir).mkdir(parents=True, exist_ok=True) # Transcribe video config = aai.TranscriptionConfig( speaker_labels=video_info.get("speakers", False), language_code=video_info.get("language", "en") ) transcript = self.transcriber.transcribe(video_path, config=config) # Save transcript ID for later retrieval metadata = { "video": video_path, "transcript_id": transcript.id, "duration": transcript.audio_duration, "confidence": transcript.confidence } with open(f"{output_dir}/metadata.json", "w") as f: json.dump(metadata, f, indent=2) # Generate output files segments = get_timestamped_segments(transcript) export_to_srt(segments, f"{output_dir}/captions.srt") export_to_vtt(segments, f"{output_dir}/captions.vtt") export_to_txt(transcript, f"{output_dir}/transcript.txt") return metadata async def process_batch(self, video_list): """ Process multiple videos concurrently """ loop = asyncio.get_event_loop() tasks = [] for video_info in video_list: task = loop.run_in_executor( self.executor, self.process_video, video_info ) tasks.append(task) results = await asyncio.gather(*tasks) return results # Example batch processing async def main(): queue = VideoTranscriptionQueue(max_concurrent=5) videos = [ {"path": "video1.mp4", "output_dir": "output/video1", "speakers": True}, {"path": "video2.mp4", "output_dir": "output/video2", "language": "es"}, {"path": "video3.mp4", "output_dir": "output/video3"} ] results = await queue.process_batch(videos) for result in results: print(f"Processed: {result['video']}") print(f"Duration: {result['duration']}s") print(f"Confidence: {result['confidence']}\n") # Run the batch processor asyncio.run(main())

This queue pattern handles failures gracefully and provides progress tracking. You can scale based on your account limits and needs, with free accounts supporting up to 5 concurrent jobs and paid accounts starting at 200 concurrent transcriptions.

The scaling approach you choose depends on your volume:

Sequential processing: Simple scripts for 1-10 videos daily
Concurrent processing: Queue systems for 10-100 videos daily
Distributed processing: Job queues with workers for 100+ videos daily

Final words

This Python implementation transforms raw video into structured, timestamped text that powers modern video workflows. You've built a complete solution that handles everything from single files to production-scale batch processing, generating the exact output formats your applications need.

AssemblyAI's Universal models provide the speech understanding accuracy that makes video transcription reliable—from proper noun recognition to speaker diarization. This foundation lets you build accessible video experiences, streamline editing workflows, and unlock the searchability of your video content at any scale.

Deploy transcription with millisecond-accurate timestamps

Scale from single files to batch processing with the Python SDK and scalable concurrency. Create your account to build production-ready pipelines with AssemblyAI.

Start free

FAQ

Which specific video file formats can I upload for transcription?

Most common video formats work directly including MP4, MOV, MKV, WEBM, AVI, and M4V. The API extracts audio automatically, so you don't need to convert videos to audio files first.

How do I create SRT files with the exact timestamp format video editors expect?

Use the segments array from the transcript to format start and end times into SRT's specific timecode format (HH:MM:SS,mmm). The formatting function in Step 4 shows the exact millisecond conversion that video editors like Premiere and Final Cut require.

Can I identify which specific person is speaking in multi-speaker videos?

Enable the speaker_labels parameter in your transcription config to get automatic speaker identification. Each utterance includes a speaker label (A, B, C, etc.) that you can include in your captions, though the system assigns generic labels rather than identifying specific individuals by name.

How many videos can I process simultaneously without hitting rate limits?

You can submit multiple concurrent transcription jobs based on your account type. Free accounts are limited to 5 concurrent jobs, while paid accounts start at 200 concurrent transcriptions and can request higher limits by contacting support.

Which languages produce the most accurate timestamped transcripts?

English provides the highest accuracy for timestamp precision, but the API supports 99 languages with automatic detection. Languages with clear word boundaries like Spanish and French typically produce more accurate segment timing than tonal languages.

How do I automatically save transcription outputs to cloud storage after processing?

Modify the process_video function to upload files directly to S3, Google Cloud Storage, or Azure after generating the SRT, VTT, and TXT files. Store the transcript ID and cloud URLs in a database for efficient retrieval.

Video transcription made simple: From segments to timestamps

Video transcription made simple: From segments to timestamps

What is AI video transcription and why timestamps matter

Transcribe a video to text with Python

Step 1. Install and set up the Python SDK

Step 2. Submit the video for async transcription

Step 3. Retrieve segments with timestamps

Step 4. Export TXT, SRT, and VTT

Step 5. Optional speaker diarization

Choose the right output format

Accuracy and performance for video transcription

Scale to production with async queues

Final words

FAQ

Which specific video file formats can I upload for transcription?

How do I create SRT files with the exact timestamp format video editors expect?

Can I identify which specific person is speaking in multi-speaker videos?

How many videos can I process simultaneously without hitting rate limits?

Which languages produce the most accurate timestamped transcripts?

How do I automatically save transcription outputs to cloud storage after processing?

How to remove or reduce background noise from audio for (stt) transcription

How do I transcribe audio in languages like Spanish, French, or German?

Top 9 AI notetakers in 2026: Compare features, pricing, and accuracy

Speech-to-text API accuracy for phone call transcription

How to build a LiveKit AI Agent for real-time Speech-to-Text

React Speech Recognition with React Hooks

Everything you need to know about Generative AI

Improved WER - Speech-to-Text Accuracy vs. Google, AWS (May 2020 Update)

Video transcription made simple: From segments to timestamps

Video transcription made simple: From segments to timestamps

What is AI video transcription and why timestamps matter

Transcribe a video to text with Python

Step 1. Install and set up the Python SDK

Step 2. Submit the video for async transcription

Step 3. Retrieve segments with timestamps

Step 4. Export TXT, SRT, and VTT

Step 5. Optional speaker diarization

Choose the right output format

Accuracy and performance for video transcription

Scale to production with async queues

Final words

FAQ

Which specific video file formats can I upload for transcription?

How do I create SRT files with the exact timestamp format video editors expect?

Can I identify which specific person is speaking in multi-speaker videos?

How many videos can I process simultaneously without hitting rate limits?

Which languages produce the most accurate timestamped transcripts?

How do I automatically save transcription outputs to cloud storage after processing?

Related posts

How to remove or reduce background noise from audio for (stt) transcription

How do I transcribe audio in languages like Spanish, French, or German?

Top 9 AI notetakers in 2026: Compare features, pricing, and accuracy

Speech-to-text API accuracy for phone call transcription

How to build a LiveKit AI Agent for real-time Speech-to-Text

React Speech Recognition with React Hooks

Everything you need to know about Generative AI

Improved WER - Speech-to-Text Accuracy vs. Google, AWS (May 2020 Update)