Build & Learn
October 22, 2025

Video transcription made simple: From segments to timestamps

AI video transcription converts your videos to accurate, timestamped text in minutes. Transcribe files to TXT, SRT, or VTT for search, captions, and editing.

Kelsey Foster
Growth
Kelsey Foster
Growth
Reviewed by
No items found.
No items found.
No items found.
No items found.
Table of contents

Video transcription made simple: From segments to timestamps

This tutorial shows you how to build a complete video transcription system that converts spoken words into timestamped text using Python and AssemblyAI's API. You'll create a solution that processes video files and generates multiple output formats—plain text for documentation, SRT files for video editing, and VTT files for web players. The system handles everything from single video files to production-scale batch processing with speaker identification.

You'll use AssemblyAI's Python SDK for async transcription, Python's built-in file handling for format conversion, and optional concurrent processing libraries for scaling to multiple videos. The implementation covers segment extraction, timestamp formatting, and export functions that produce industry-standard caption files compatible with major video editing platforms and streaming services.

What is AI video transcription and why timestamps matter

AI video transcription converts spoken words in your video files into text with precise timestamps, serving a video transcription market valued at $30.42 billion in 2024. This means each piece of text includes the exact start and end times when those words were spoken in your video. You get searchable text plus the timing data needed to create captions, jump to specific moments, or build interactive video experiences.

The timestamp precision is what separates basic transcription from video-ready output. Without accurate timing, you can't sync captions with speech or create clickable transcripts that jump to specific video moments.

Here's what different timestamp formats enable:

Format

Use Case

Timestamp Precision

Compatibility

TXT

Search indexing, documentation

None or paragraph-level

Universal text editors

SRT

Video editing, social platforms

Millisecond (00:00:00,000)

Premiere, Final Cut, YouTube

VTT

Web video players

Millisecond (00:00:00.000)

HTML5 video, streaming platforms

JSON

Custom applications

Millisecond with metadata

APIs, databases, analytics

The segments structure makes your video content actionable. You can search for specific phrases and jump directly to that moment, generate subtitles that appear at the right time, or identify key topics for video chapters.

Transcribe a video to text with Python

You'll use AssemblyAI's Python SDK to transcribe video files asynchronously. This approach handles large files efficiently and provides detailed progress tracking while you build a complete solution that processes videos and generates multiple output formats.

The workflow has five steps: install the SDK, submit videos for transcription, retrieve timestamped segments, export to different formats, and add speaker identification when needed.

Step 1. Install and set up the Python SDK

Start by installing the AssemblyAI Python package. You'll need Python 3.8 or higher and an API key from your AssemblyAI dashboard.

# Install the SDK
pip install assemblyai

# Create a new Python file: video_transcriber.py
import assemblyai as aai
import json
from pathlib import Path

# Configure your API key
aai.settings.api_key = "your-api-key-here"

# Initialize the transcriber client
transcriber = aai.Transcriber()

Store your API key as an environment variable for security:

# Install python-dotenv
pip install python-dotenv

# Updated imports
import os
from dotenv import load_dotenv
import assemblyai as aai

# Load environment variables
load_dotenv()
aai.settings.api_key = os.getenv("ASSEMBLYAI_API_KEY")

Create a .env file in your project directory and add your API key:

ASSEMBLYAI_API_KEY=your-actual-api-key-here

Step 2. Submit the video for async transcription

The transcriber accepts both local files and URLs to videos hosted online. For local files, the SDK handles uploading automatically, while cloud-hosted videos process faster since there's no upload step.

def transcribe_video(video_path, language_code="en"):
   """
   Submit a video for transcription with automatic language detection
   
   Args:
       video_path: Path to local video file or URL
       language_code: ISO language code or None for auto-detection
   
   Returns:
       Transcript object with segments and metadata
   """
   
   # Configure transcription settings
   config = aai.TranscriptionConfig(
       language_code=language_code,
       punctuate=True,
       format_text=True
   )
   
   # Submit for async transcription
   transcript = transcriber.transcribe(video_path, config=config)
   
   # The SDK automatically polls for completion
   if transcript.status == aai.TranscriptStatus.error:
       raise Exception(f"Transcription failed: {transcript.error}")
   
   return transcript

# Example usage with a local file
video_file = "meeting_recording.mp4"
transcript = transcribe_video(video_file)

print(f"Transcription ID: {transcript.id}")
print(f"Duration: {transcript.audio_duration} seconds")
print(f"Confidence: {transcript.confidence}")

The async approach handles videos of any length efficiently. The SDK manages polling intervals and connection handling, so you don't need to implement retry logic yourself.

Get your API key to start transcribing

You just submitted async jobs with the Python SDK. Create an account to get an API key and run the code on your own videos.

Get free API key

Step 3. Retrieve segments with timestamps

The transcript object contains a words array with detailed timing for each word. You'll use the SDK's built-in methods to group these into meaningful segments for caption generation.

def get_timestamped_segments(transcript):
   """
   Extract segments with precise timestamps using SDK methods
   
   Returns:
       List of segments with start, end, and text fields
   """
   # Use built-in SDK methods for accurate segmentation
   sentences = transcript.get_sentences()
   
   segments = []
   for sentence in sentences:
       segment = {
           "start": sentence.start,
           "end": sentence.end,
           "text": sentence.text,
           "confidence": sentence.confidence
       }
       segments.append(segment)
   
   return segments

# For larger segments, use paragraphs instead
def get_paragraph_segments(transcript):
   """
   Get paragraph-level segments for longer captions
   """
   paragraphs = transcript.get_paragraphs()
   
   segments = []
   for paragraph in paragraphs:
       segment = {
           "start": paragraph.start,
           "end": paragraph.end,
           "text": paragraph.text,
           "confidence": paragraph.confidence
       }
       segments.append(segment)
   
   return segments

# Extract segments from your transcript
segments = get_timestamped_segments(transcript)

# Display first few segments
for i, segment in enumerate(segments[:3]):
   start_time = segment["start"] / 1000  # Convert to seconds
   end_time = segment["end"] / 1000
   print(f"[{start_time:.2f}s - {end_time:.2f}s]: {segment['text']}")

These millisecond-precision timestamps ensure your captions sync perfectly with the audio. The confidence scores help you identify sections that might need manual review.

Step 4. Export TXT, SRT, and VTT

Transform the segments into standard caption formats using the SDK's built-in export methods. Each format has specific timestamp requirements and syntax rules that video players expect.

def export_subtitles(transcript, base_filename="output"):
   """
   Generate subtitle files using SDK's built-in methods
   """
   
   # Export SRT format for video editing software
   srt_content = transcript.export_subtitles_srt()
   with open(f"{base_filename}.srt", "w", encoding="utf-8") as f:
       f.write(srt_content)
   print(f"SRT file saved: {base_filename}.srt")
   
   # Export VTT format for web players
   vtt_content = transcript.export_subtitles_vtt()
   with open(f"{base_filename}.vtt", "w", encoding="utf-8") as f:
       f.write(vtt_content)
   print(f"VTT file saved: {base_filename}.vtt")
   
   # Export plain text transcript
   with open(f"{base_filename}.txt", "w", encoding="utf-8") as f:
       f.write(transcript.text)
   print(f"Text file saved: {base_filename}.txt")

# Generate all output formats with one function call
export_subtitles(transcript, "my_video")

# You can also customize subtitle length if needed
def export_custom_subtitles(transcript, max_chars=50):
   """
   Export subtitles with custom character limits
   """
   srt_content = transcript.export_subtitles_srt(chars_per_caption=max_chars)
   vtt_content = transcript.export_subtitles_vtt(chars_per_caption=max_chars)
   
   return srt_content, vtt_content

The key differences in timestamp formatting matter for compatibility:

  • SRT format: Uses comma-separated milliseconds (00:00:00,000)
  • VTT format: Uses decimal seconds (00:00:00.000)
  • Plain text: Contains no timing information

Step 5. Optional speaker diarization

For videos with multiple speakers, enable speaker diarization to identify who's talking. This adds speaker labels to each segment, which is essential for interviews, meetings, and panel discussions.

def transcribe_with_speakers(video_path):
   """
   Transcribe video with speaker identification
   """
   config = aai.TranscriptionConfig(
       speaker_labels=True,
       punctuate=True,
       format_text=True
   )
   
   transcript = transcriber.transcribe(video_path, config=config)
   
   # Process utterances with speaker labels
   segments_with_speakers = []
   
   for utterance in transcript.utterances:
       segment = {
           "speaker": utterance.speaker,
           "start": utterance.start,
           "end": utterance.end,
           "text": utterance.text,
           "confidence": utterance.confidence
       }
       segments_with_speakers.append(segment)
   
   return segments_with_speakers

# Generate SRT with speaker labels
def export_srt_with_speakers(segments, output_file="output_speakers.srt"):
   def format_srt_time(milliseconds):
       seconds = milliseconds / 1000
       hours = int(seconds // 3600)
       minutes = int((seconds % 3600) // 60)
       secs = int(seconds % 60)
       millis = int((milliseconds % 1000))
       return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"
   
   with open(output_file, "w", encoding="utf-8") as f:
       for i, segment in enumerate(segments, 1):
           f.write(f"{i}\n")
           start = format_srt_time(segment["start"])
           end = format_srt_time(segment["end"])
           f.write(f"{start} --> {end}\n")
           f.write(f"[Speaker {segment['speaker']}]: {segment['text'].strip()}\n\n")

Speaker diarization adds minimal processing time but significantly improves transcript readability for multi-speaker content. You'll get labels like "Speaker A" and "Speaker B" automatically assigned to different voices.

Choose the right output format

Your choice of transcript format depends on where the content will be displayed and what features you need. Different video workflows require specific formats that support their unique requirements.

TXT for documentation and search: Plain text transcripts work best for documentation, meeting notes, and content indexing. They're searchable, easy to edit, and compatible with any text processing system. Use TXT when you need the content without timing information.

SRT for video editing: SubRip (.srt) files are the industry standard for video editing software. Adobe Premiere, Final Cut Pro, and DaVinci Resolve all import SRT files natively. Social platforms like YouTube, Facebook, and LinkedIn also accept SRT for caption uploads.

VTT for web players: WebVTT (.vtt) files are designed for HTML5 video elements and streaming platforms. They support advanced styling options and metadata that SRT doesn't offer.

The key distinctions for each format:

  • TXT: No timestamps, universal compatibility, best for text analysis
  • SRT: Comma-separated milliseconds, works with all video editors
  • VTT: Decimal seconds, supports web styling and cue metadata
  • JSON: Raw segment data, perfect for custom applications

Accuracy and performance for video transcription

Transcription accuracy directly impacts the usability of your timestamps and captions. Poor accuracy leads to misaligned captions, incorrect segment boundaries, and frustrated viewers who can't follow along with your content.

Three key factors determine transcription quality:

  • Audio quality: Clear audio with minimal background noise produces the most accurate timestamps
  • Language detection: Correct language identification ensures proper word boundaries and punctuation
  • Proper noun handling: Names, brands, and technical terms often cause segment splitting errors

Audio quality affects timestamp precision more than you might expect. Compressed audio from social media downloads or low-bitrate recordings can cause timestamp drift where captions gradually fall out of sync with the actual speech.

Language detection ensures the AI model uses the right pronunciation and grammar rules. Auto-detection works well for monolingual content, but you should specify the language code for mixed-language videos or technical content with industry jargon.

Proper noun recognition maintains segment integrity even with challenging terminology. AssemblyAI's Universal models excel at recognizing company names, product references, and technical terms without breaking them across multiple segments. This prevents awkward caption breaks in the middle of important names or phrases.

Modern speech-to-text systems go beyond simple word recognition, using advanced AI and machine learning algorithms to achieve accuracy rates often exceeding 95%. The AI understands context, maintains consistency across similar-sounding words, and properly formats numbers and dates for better readability.

Scale to production with async queues

Production applications need to handle multiple videos efficiently without blocking your application, reflecting the rapid growth of the AI-powered meeting assistants market from $2.68 billion in 2024 to a projected $24.6 billion by 2034. The async pattern scales naturally—you can submit multiple videos simultaneously and process results as they complete. AssemblyAI supports scalable concurrent transcriptions with limits based on your account type. Free accounts are limited to 5 concurrent jobs, while paid accounts start at 200 and can request higher limits by contacting support.

import asyncio
from concurrent.futures import ThreadPoolExecutor
import assemblyai as aai

class VideoTranscriptionQueue:
   def __init__(self, max_concurrent=5):
       self.transcriber = aai.Transcriber()
       self.max_concurrent = max_concurrent
       self.executor = ThreadPoolExecutor(max_workers=max_concurrent)
   
   def process_video(self, video_info):
       """
       Process a single video and store results
       """
       video_path = video_info["path"]
       output_dir = video_info["output_dir"]
       
       # Create output directory
       Path(output_dir).mkdir(parents=True, exist_ok=True)
       
       # Transcribe video
       config = aai.TranscriptionConfig(
           speaker_labels=video_info.get("speakers", False),
           language_code=video_info.get("language", "en")
       )
       
       transcript = self.transcriber.transcribe(video_path, config=config)
       
       # Save transcript ID for later retrieval
       metadata = {
           "video": video_path,
           "transcript_id": transcript.id,
           "duration": transcript.audio_duration,
           "confidence": transcript.confidence
       }
       
       with open(f"{output_dir}/metadata.json", "w") as f:
           json.dump(metadata, f, indent=2)
       
       # Generate output files
       segments = get_timestamped_segments(transcript)
       export_to_srt(segments, f"{output_dir}/captions.srt")
       export_to_vtt(segments, f"{output_dir}/captions.vtt")
       export_to_txt(transcript, f"{output_dir}/transcript.txt")
       
       return metadata
   
   async def process_batch(self, video_list):
       """
       Process multiple videos concurrently
       """
       loop = asyncio.get_event_loop()
       
       tasks = []
       for video_info in video_list:
           task = loop.run_in_executor(
               self.executor,
               self.process_video,
               video_info
           )
           tasks.append(task)
       
       results = await asyncio.gather(*tasks)
       return results

# Example batch processing
async def main():
   queue = VideoTranscriptionQueue(max_concurrent=5)
   
   videos = [
       {"path": "video1.mp4", "output_dir": "output/video1", "speakers": True},
       {"path": "video2.mp4", "output_dir": "output/video2", "language": "es"},
       {"path": "video3.mp4", "output_dir": "output/video3"}
   ]
   
   results = await queue.process_batch(videos)
   
   for result in results:
       print(f"Processed: {result['video']}")
       print(f"Duration: {result['duration']}s")
       print(f"Confidence: {result['confidence']}\n")

# Run the batch processor
asyncio.run(main())

This queue pattern handles failures gracefully and provides progress tracking. You can scale based on your account limits and needs, with free accounts supporting up to 5 concurrent jobs and paid accounts starting at 200 concurrent transcriptions.

The scaling approach you choose depends on your volume:

  • Sequential processing: Simple scripts for 1-10 videos daily
  • Concurrent processing: Queue systems for 10-100 videos daily
  • Distributed processing: Job queues with workers for 100+ videos daily

Final words

This Python implementation transforms raw video into structured, timestamped text that powers modern video workflows. You've built a complete solution that handles everything from single files to production-scale batch processing, generating the exact output formats your applications need.

AssemblyAI's Universal models provide the speech understanding accuracy that makes video transcription reliable—from proper noun recognition to speaker diarization. This foundation lets you build accessible video experiences, streamline editing workflows, and unlock the searchability of your video content at any scale.

Deploy transcription with millisecond-accurate timestamps

Scale from single files to batch processing with the Python SDK and scalable concurrency. Create your account to build production-ready pipelines with AssemblyAI.

Start free

FAQ

Which specific video file formats can I upload for transcription?

Most common video formats work directly including MP4, MOV, MKV, WEBM, AVI, and M4V. The API extracts audio automatically, so you don't need to convert videos to audio files first.

How do I create SRT files with the exact timestamp format video editors expect?

Use the segments array from the transcript to format start and end times into SRT's specific timecode format (HH:MM:SS,mmm). The formatting function in Step 4 shows the exact millisecond conversion that video editors like Premiere and Final Cut require.

Can I identify which specific person is speaking in multi-speaker videos?

Enable the speaker_labels parameter in your transcription config to get automatic speaker identification. Each utterance includes a speaker label (A, B, C, etc.) that you can include in your captions, though the system assigns generic labels rather than identifying specific individuals by name.

How many videos can I process simultaneously without hitting rate limits?

You can submit multiple concurrent transcription jobs based on your account type. Free accounts are limited to 5 concurrent jobs, while paid accounts start at 200 concurrent transcriptions and can request higher limits by contacting support.

Which languages produce the most accurate timestamped transcripts?

English provides the highest accuracy for timestamp precision, but the API supports 99 languages with automatic detection. Languages with clear word boundaries like Spanish and French typically produce more accurate segment timing than tonal languages.

How do I automatically save transcription outputs to cloud storage after processing?

Modify the process_video function to upload files directly to S3, Google Cloud Storage, or Azure after generating the SRT, VTT, and TXT files. Store the transcript ID and cloud URLs in a database for efficient retrieval.

Title goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Button Text
Speech-to-Text
Automatic Speech Recognition