October 22, 2025

How to transcribe (stt) audio with timestamps for captions with AssemblyAI

How to transcribe STT audio with timestamps: Learn simple steps to convert audio files into accurate, time-coded transcripts for captions or subtitles.

Speech-to-Text

Automatic Speech Recognition

Kelsey Foster

Growth

Kelsey Foster

Growth

Reviewed by

No items found.

Table of contents

[Visible on live site]

This tutorial shows you how to build a complete audio transcription system that generates timestamped captions for videos. You'll create SRT and WebVTT caption files with precise word and sentence timing, plus optional speaker identification for multi-person conversations.

You'll use the AssemblyAI Python SDK to transcribe audio files with millisecond-accurate timestamps, then convert that timing data into industry-standard caption formats. The tutorial covers everything from basic setup and API authentication to advanced features like speaker diarization that automatically identifies who's speaking when. By the end, you'll have working code that transforms any audio or video file into professional caption files ready for streaming platforms, video players, or accessibility compliance.

What do I install and how do I authenticate the SDK?

You need two things to start transcribing audio with timestamps: the AssemblyAI Python SDK installed on your computer and an API key to access the transcription service. This setup takes about 5 minutes and works on any system with Python 3.8 or newer.

Install the AssemblyAI Python SDK

Open your terminal or command prompt and run this single command:

pip install assemblyai

That's it. The command downloads everything you need to start transcribing audio files with timestamps.

If you're using a virtual environment (which keeps your projects organized), activate it first:

# Create virtual environment (optional but recommended) python -m venv transcription-env source transcription-env/bin/activate # On Windows: transcription-env\Scripts\activate # Then install pip install assemblyai

Configure your API key

Your API key is like a password that lets you use AssemblyAI's transcription service. Get yours from the AssemblyAI dashboard after signing up—it's free to start.

Store your API key safely as an environment variable:

# On Mac/Linux export ASSEMBLYAI_API_KEY="your-api-key-here" # On Windows set ASSEMBLYAI_API_KEY="your-api-key-here"

Now set up your Python script to use the API key:

import assemblyai as aai import os # Tell AssemblyAI to use your API key aai.settings.api_key = os.getenv("ASSEMBLYAI_API_KEY") # Create the transcriber that'll do all the work transcriber = aai.Transcriber()

Why use environment variables? Your API key stays secure and separate from your code. You can share your script without accidentally sharing your credentials.

Get your AssemblyAI API key to start transcribing

Create a free account to generate your API key and run the Python examples above without changes to your setup.

Get free API key

How do I transcribe audio with word and sentence timestamps in Python?

Transcribing with timestamps means you get the exact time when each word and sentence was spoken in your audio file. The process works in two steps: you submit your audio file for transcription with timestamp options turned on, then you parse the timing data that comes back.

Submit an async transcription job with timestamps

Here's the complete code to transcribe any audio file with timestamps:

import assemblyai as aai # Set up your API key aai.settings.api_key = "your-api-key-here" # Configure what you want from the transcription config = aai.TranscriptionConfig( word_timestamps=True # Get timing for every single word ) # Create the transcriber transcriber = aai.Transcriber() # Submit your audio file (works with MP3, MP4, WAV, FLAC, and M4A) transcript = transcriber.transcribe( "path/to/your/audio.mp3", # Or use a URL: "https://example.com/audio.mp3" config=config ) # Check if it worked if transcript.status == aai.TranscriptStatus.error: print(f"Something went wrong: {transcript.error}") else: print("Your transcript is ready!") print(f"Text: {transcript.text}")

The transcription happens in AssemblyAI's cloud servers, not on your computer. A 10-minute audio file typically takes 2-3 minutes to process.

File format support: You can use almost any audio or video file—the system automatically extracts the audio track from video files like MP4.

Parse word and sentence timestamps from the response

Once transcription finishes, you get back detailed timing information. Each timestamp shows milliseconds from the start of your audio file:

# Look at individual word timestamps print("First 5 words with their exact timing:") for word in transcript.words[:5]: start_sec = word.start / 1000 # Convert to seconds for readability end_sec = word.end / 1000 print(f"'{word.text}' from {start_sec:.1f}s to {end_sec:.1f}s") print("\nFirst 3 sentences:") for i, sentence in enumerate(transcript.get_sentences()[:3]): start_sec = sentence.start / 1000 end_sec = sentence.end / 1000 duration = end_sec - start_sec print(f"Sentence {i+1}:") print(f" Text: {sentence.text}") print(f" Time: {start_sec:.1f}s to {end_sec:.1f}s ({duration:.1f}s long)")

This gives you two types of timestamps:

Word timestamps: Perfect for karaoke-style highlighting or precise synchronization
Sentence timestamps: Ideal for standard subtitles where complete thoughts appear together

The system also provides confidence scores for each word, helping you identify sections that might need review.

How do I generate SRT and WebVTT from timestamps?

SRT and WebVTT are the two standard caption file formats that work with video players and streaming platforms. SRT is simpler and works everywhere, while WebVTT supports more advanced features like styling.

Here's what makes them different:

Feature

SRT

WebVTT

Time format

00:00:00,000 (comma)

00:00:00.000 (period)

File header

None

Must start with "WEBVTT"

Styling

Basic

CSS support

Platform support

Universal

HTML5 video, modern browsers

Write SRT from sentence timestamps

SRT files follow a simple pattern: caption number, timing line, text content, blank line. Here's the complete function:

def create_srt_file(transcript, output_path="captions.srt"): """Turn your transcript into an SRT caption file""" def ms_to_srt_time(milliseconds): """Convert milliseconds to SRT format: 00:00:00,000""" total_seconds = milliseconds / 1000 hours = int(total_seconds // 3600) minutes = int((total_seconds % 3600) // 60) seconds = int(total_seconds % 60) ms = int((total_seconds % 1) * 1000) return f"{hours:02d}:{minutes:02d}:{seconds:02d},{ms:03d}" sentences = transcript.get_sentences() if not sentences: print("No sentences found in the transcript.") return with open(output_path, 'w', encoding='utf-8') as f: for i, sentence in enumerate(sentences, 1): # Caption number f.write(f"{i}\n") # Timing line start = ms_to_srt_time(sentence.start) end = ms_to_srt_time(sentence.end) f.write(f"{start} --> {end}\n") # The actual caption text text = sentence.text.strip() # Break long lines for better readability if len(text) > 42: # Find a natural break point mid = len(text) // 2 break_point = text.rfind(' ', mid - 10, mid + 10) if break_point == -1: # No good break found break_point = mid line1 = text[:break_point].strip() line2 = text[break_point:].strip() f.write(f"{line1}\n{line2}\n") else: f.write(f"{text}\n") # Blank line separates each caption f.write("\n") print(f"SRT file saved as: {output_path}") # Use it with your transcript create_srt_file(transcript, "my_captions.srt")

Line length matters: Most caption guidelines recommend 32-42 characters per line so text doesn't overwhelm the video.

Write WebVTT from sentence timestamps

WebVTT files are nearly identical to SRT but with slightly different formatting:

def create_webvtt_file(transcript, output_path="captions.vtt"): """Create a WebVTT caption file from your transcript""" def ms_to_webvtt_time(milliseconds): """Convert milliseconds to WebVTT format: 00:00:00.000""" total_seconds = milliseconds / 1000 hours = int(total_seconds // 3600) minutes = int((total_seconds % 3600) // 60) seconds = int(total_seconds % 60) ms = int((total_seconds % 1) * 1000) return f"{hours:02d}:{minutes:02d}:{seconds:02d}.{ms:03d}" sentences = transcript.get_sentences() if not sentences: print("No sentences found in the transcript.") return with open(output_path, 'w', encoding='utf-8') as f: # WebVTT files must start with this header f.write("WEBVTT\n\n") for i, sentence in enumerate(sentences, 1): # Optional identifier f.write(f"{i}\n") # Timing (note the period instead of comma) start = ms_to_webvtt_time(sentence.start) end = ms_to_webvtt_time(sentence.end) f.write(f"{start} --> {end}\n") # Caption text (same line-breaking as SRT) text = sentence.text.strip() if len(text) > 42: mid = len(text) // 2 break_point = text.rfind(' ', mid - 10, mid + 10) if break_point == -1: break_point = mid line1 = text[:break_point].strip() line2 = text[break_point:].strip() f.write(f"{line1}\n{line2}\n") else: f.write(f"{text}\n") f.write("\n") print(f"WebVTT file saved as: {output_path}") # Create both formats from the same transcript create_webvtt_file(transcript, "my_captions.vtt")

Both functions work with the same transcript data—you're just changing the output format.

Test timestamps and captions in your browser

Upload a file and preview word and sentence timestamps, then validate your SRT or WebVTT output before integrating code.

Open playground

How do I add speaker labels with diarization?

Speaker diarization identifies who's talking when in your audio. This means your captions can show "Speaker A:" or "John:" before each person's words, making conversations much easier to follow.

Enable diarization by adding one parameter to your config:

# Configure transcription with speaker identification config = aai.TranscriptionConfig( speaker_labels=True, # This enables speaker diarization word_timestamps=True # Optional: word-level timing too ) # Transcribe as usual transcriber = aai.Transcriber() transcript = transcriber.transcribe("conversation.mp3", config=config) # Now you get speaker information if transcript.utterances: for utterance in transcript.utterances[:3]: # First 3 speaker turns start_sec = utterance.start / 1000 end_sec = utterance.end / 1000 print(f"Speaker {utterance.speaker} ({start_sec:.1f}s-{end_sec:.1f}s):") print(f" '{utterance.text}'\n")

With diarization enabled, you get utterances instead of sentences—each utterance represents one speaker talking continuously.

Here's how to create captions with speaker labels:

def create_srt_with_speakers(transcript, output_path="speaker_captions.srt", speaker_names=None): """Create SRT captions with speaker identification""" def ms_to_srt_time(ms): total_seconds = ms / 1000 hours = int(total_seconds // 3600) minutes = int((total_seconds % 3600) // 60) seconds = int(total_seconds % 60) milliseconds = int((total_seconds % 1) * 1000) return f"{hours:02d}:{minutes:02d}:{seconds:02d},{milliseconds:03d}" if not transcript.utterances: print("No speaker utterances found - make sure speaker_labels=True") return # Optional: map generic labels to real names if not speaker_names: speaker_names = {} # Will use generic "Speaker A", "Speaker B", etc. with open(output_path, 'w', encoding='utf-8') as f: for i, utterance in enumerate(transcript.utterances, 1): f.write(f"{i}\n") # Timing start = ms_to_srt_time(utterance.start) end = ms_to_srt_time(utterance.end) f.write(f"{start} --> {end}\n") # Get speaker name (real name or generic label) speaker_id = utterance.speaker speaker_name = speaker_names.get(speaker_id, f"Speaker {speaker_id}") # Format the text with speaker label full_text = f"{speaker_name}: {utterance.text}" # Handle line breaks (speaker label might make line longer) if len(full_text) > 42: # Try to keep speaker label on first line speaker_part = f"{speaker_name}: " remaining_budget = 42 - len(speaker_part) if remaining_budget > 15: # Enough room for some text # Find break point in the utterance text utterance_text = utterance.text break_point = utterance_text.find(' ', remaining_budget - 5) if break_point > 0 and break_point < remaining_budget + 5: line1 = f"{speaker_part}{utterance_text[:break_point]}" line2 = utterance_text[break_point:].strip() f.write(f"{line1}\n{line2}\n") else: # Put speaker label on its own line f.write(f"{speaker_name}:\n{utterance.text}\n") else: # Speaker name is too long for same line f.write(f"{speaker_name}:\n{utterance.text}\n") else: f.write(f"{full_text}\n") f.write("\n") print(f"Speaker-labeled SRT saved as: {output_path}") # Use with generic speaker labels create_srt_with_speakers(transcript) # Or map to real names if you know who's speaking speaker_mapping = { "A": "Alice Johnson", "B": "Bob Smith", "C": "Charlie Brown" } create_srt_with_speakers(transcript, "named_speakers.srt", speaker_mapping)

Common speaker label formats you can choose:

Full format: "Speaker A: Hello everyone"
Short format: "A: Hello everyone"
Separate line: Put speaker name on its own line above the text
Real names: Replace generic labels with actual names when known

The diarization works best with clear audio where speakers don't talk over each other. It can usually handle 2-10 different speakers, though accuracy drops with more speakers or background noise.

Final words

You now have a complete pipeline that takes any audio file and produces professional caption files with precise timestamps. The workflow is straightforward: configure AssemblyAI's speech-to-text API with timestamp options, submit your audio for transcription, then convert the timing data into SRT or WebVTT caption formats that work with any video platform.

AssemblyAI's Voice AI models handle the complex work of speech recognition and timestamp alignment, delivering the accuracy needed for professional captioning workflows. The API processes multiple audio formats, identifies different speakers automatically, and returns millisecond-precise timing data that translates directly into caption files—giving you broadcast-quality results without the manual transcription work.

Build captioning into your app today

Use the Python SDK to create timestamped transcripts, generate SRT/WebVTT files, and add speaker labels—all with the AssemblyAI API.

Start free

FAQ

How do I convert word-level timestamps into sentence-based captions?

After transcribing your audio, you can call the `.get_sentences()` method on the completed transcript object. This will return a list of sentences, each with its own text, start time, and end time, without needing to set any special parameters during transcription.

What audio file formats work with timestamp transcription?

AssemblyAI accepts MP3, MP4, WAV, FLAC, M4A, and most other common audio and video formats. The system automatically extracts audio tracks from video files, so you can upload MP4 videos directly without converting them first.

How do I adjust caption timing when the timestamps don't match my video exactly?

Add or subtract a fixed offset from all timestamps before generating your caption files. Most timing mismatches come from different start points between your audio file and final video—a simple time shift usually fixes synchronization issues.

Can I get timestamps in seconds instead of milliseconds from the API?

The API returns all timestamps in milliseconds, but you can easily convert them by dividing by 1000. This gives you decimal seconds that are easier to work with if you're building custom timing displays or doing manual calculations.

How accurate is speaker diarization for identifying different voices in my audio?

Speaker diarization works best with 2-4 clearly distinct speakers in good quality audio. Accuracy decreases with background noise, similar-sounding voices, or when speakers talk over each other—but it typically identifies speaker changes correctly in most conversation scenarios.

How to transcribe (stt) audio with timestamps for captions with AssemblyAI

What do I install and how do I authenticate the SDK?

Install the AssemblyAI Python SDK

Configure your API key

How do I transcribe audio with word and sentence timestamps in Python?

Submit an async transcription job with timestamps

Parse word and sentence timestamps from the response

How do I generate SRT and WebVTT from timestamps?

Write SRT from sentence timestamps

Write WebVTT from sentence timestamps

How do I add speaker labels with diarization?

Final words

FAQ

How do I convert word-level timestamps into sentence-based captions?

What audio file formats work with timestamp transcription?

How do I adjust caption timing when the timestamps don't match my video exactly?

Can I get timestamps in seconds instead of milliseconds from the API?

How accurate is speaker diarization for identifying different voices in my audio?

How to remove or reduce background noise from audio for (stt) transcription

How do I transcribe audio in languages like Spanish, French, or German?

Top 9 AI notetakers in 2026: Compare features, pricing, and accuracy

Speech-to-text API accuracy for phone call transcription

What is speaker diarization and how does it work? (Complete 2026 Guide)

7 best conversation intelligence software in 2025

The best audio file formats for speech-to-text: A guide

Announcing Our $28M Series A Led by Accel

How to transcribe (stt) audio with timestamps for captions with AssemblyAI

What do I install and how do I authenticate the SDK?

Install the AssemblyAI Python SDK

Configure your API key

How do I transcribe audio with word and sentence timestamps in Python?

Submit an async transcription job with timestamps

Parse word and sentence timestamps from the response

How do I generate SRT and WebVTT from timestamps?

Write SRT from sentence timestamps

Write WebVTT from sentence timestamps

How do I add speaker labels with diarization?

Final words

FAQ

How do I convert word-level timestamps into sentence-based captions?

What audio file formats work with timestamp transcription?

How do I adjust caption timing when the timestamps don't match my video exactly?

Can I get timestamps in seconds instead of milliseconds from the API?

How accurate is speaker diarization for identifying different voices in my audio?

Related posts

How to remove or reduce background noise from audio for (stt) transcription

How do I transcribe audio in languages like Spanish, French, or German?

Top 9 AI notetakers in 2026: Compare features, pricing, and accuracy

Speech-to-text API accuracy for phone call transcription

What is speaker diarization and how does it work? (Complete 2026 Guide)

7 best conversation intelligence software in 2025

The best audio file formats for speech-to-text: A guide

Announcing Our $28M Series A Led by Accel