Build & Learn
October 22, 2025

How to transcribe (stt) audio with timestamps for captions with AssemblyAI

How to transcribe STT audio with timestamps: Learn simple steps to convert audio files into accurate, time-coded transcripts for captions or subtitles.

Kelsey Foster
Growth
Kelsey Foster
Growth
Reviewed by
No items found.
No items found.
No items found.
No items found.
Table of contents

This tutorial shows you how to build a complete audio transcription system that generates timestamped captions for videos. You'll create SRT and WebVTT caption files with precise word and sentence timing, plus optional speaker identification for multi-person conversations.

You'll use the AssemblyAI Python SDK to transcribe audio files with millisecond-accurate timestamps, then convert that timing data into industry-standard caption formats. The tutorial covers everything from basic setup and API authentication to advanced features like speaker diarization that automatically identifies who's speaking when. By the end, you'll have working code that transforms any audio or video file into professional caption files ready for streaming platforms, video players, or accessibility compliance.

What do I install and how do I authenticate the SDK?

You need two things to start transcribing audio with timestamps: the AssemblyAI Python SDK installed on your computer and an API key to access the transcription service. This setup takes about 5 minutes and works on any system with Python 3.8 or newer.

Install the AssemblyAI Python SDK

Open your terminal or command prompt and run this single command:

pip install assemblyai

That's it. The command downloads everything you need to start transcribing audio files with timestamps.

If you're using a virtual environment (which keeps your projects organized), activate it first:

# Create virtual environment (optional but recommended)
python -m venv transcription-env
source transcription-env/bin/activate  # On Windows: transcription-env\Scripts\activate

# Then install
pip install assemblyai

Configure your API key

Your API key is like a password that lets you use AssemblyAI's transcription service. Get yours from the AssemblyAI dashboard after signing up—it's free to start.

Store your API key safely as an environment variable:

# On Mac/Linux
export ASSEMBLYAI_API_KEY="your-api-key-here"

# On Windows
set ASSEMBLYAI_API_KEY="your-api-key-here"

Now set up your Python script to use the API key:

import assemblyai as aai
import os

# Tell AssemblyAI to use your API key
aai.settings.api_key = os.getenv("ASSEMBLYAI_API_KEY")

# Create the transcriber that'll do all the work
transcriber = aai.Transcriber()

Why use environment variables? Your API key stays secure and separate from your code. You can share your script without accidentally sharing your credentials.

Get your AssemblyAI API key to start transcribing

Create a free account to generate your API key and run the Python examples above without changes to your setup.

Get free API key

How do I transcribe audio with word and sentence timestamps in Python?

Transcribing with timestamps means you get the exact time when each word and sentence was spoken in your audio file. The process works in two steps: you submit your audio file for transcription with timestamp options turned on, then you parse the timing data that comes back.

Submit an async transcription job with timestamps

Here's the complete code to transcribe any audio file with timestamps:

import assemblyai as aai

# Set up your API key
aai.settings.api_key = "your-api-key-here"

# Configure what you want from the transcription
config = aai.TranscriptionConfig(
   word_timestamps=True  # Get timing for every single word
)

# Create the transcriber
transcriber = aai.Transcriber()

# Submit your audio file (works with MP3, MP4, WAV, FLAC, and M4A)
transcript = transcriber.transcribe(
   "path/to/your/audio.mp3",  # Or use a URL: "https://example.com/audio.mp3"
   config=config
)

# Check if it worked
if transcript.status == aai.TranscriptStatus.error:
   print(f"Something went wrong: {transcript.error}")
else:
   print("Your transcript is ready!")
   print(f"Text: {transcript.text}")

The transcription happens in AssemblyAI's cloud servers, not on your computer. A 10-minute audio file typically takes 2-3 minutes to process.

File format support: You can use almost any audio or video file—the system automatically extracts the audio track from video files like MP4.

Parse word and sentence timestamps from the response

Once transcription finishes, you get back detailed timing information. Each timestamp shows milliseconds from the start of your audio file:

# Look at individual word timestamps
print("First 5 words with their exact timing:")
for word in transcript.words[:5]:
   start_sec = word.start / 1000  # Convert to seconds for readability
   end_sec = word.end / 1000
   print(f"'{word.text}' from {start_sec:.1f}s to {end_sec:.1f}s")

print("\nFirst 3 sentences:")
for i, sentence in enumerate(transcript.get_sentences()[:3]):
   start_sec = sentence.start / 1000
   end_sec = sentence.end / 1000
   duration = end_sec - start_sec
   
   print(f"Sentence {i+1}:")
   print(f"  Text: {sentence.text}")
   print(f"  Time: {start_sec:.1f}s to {end_sec:.1f}s ({duration:.1f}s long)")

This gives you two types of timestamps:

  • Word timestamps: Perfect for karaoke-style highlighting or precise synchronization
  • Sentence timestamps: Ideal for standard subtitles where complete thoughts appear together

The system also provides confidence scores for each word, helping you identify sections that might need review.

How do I generate SRT and WebVTT from timestamps?

SRT and WebVTT are the two standard caption file formats that work with video players and streaming platforms. SRT is simpler and works everywhere, while WebVTT supports more advanced features like styling.

Here's what makes them different:

Feature

SRT

WebVTT

Time format

00:00:00,000 (comma)

00:00:00.000 (period)

File header

None

Must start with "WEBVTT"

Styling

Basic

CSS support

Platform support

Universal

HTML5 video, modern browsers

Write SRT from sentence timestamps

SRT files follow a simple pattern: caption number, timing line, text content, blank line. Here's the complete function:

def create_srt_file(transcript, output_path="captions.srt"):
   """Turn your transcript into an SRT caption file"""
   
   def ms_to_srt_time(milliseconds):
       """Convert milliseconds to SRT format: 00:00:00,000"""
       total_seconds = milliseconds / 1000
       hours = int(total_seconds // 3600)
       minutes = int((total_seconds % 3600) // 60)
       seconds = int(total_seconds % 60)
       ms = int((total_seconds % 1) * 1000)
       
       return f"{hours:02d}:{minutes:02d}:{seconds:02d},{ms:03d}"
   
   sentences = transcript.get_sentences()
   if not sentences:
       print("No sentences found in the transcript.")
       return
   
   with open(output_path, 'w', encoding='utf-8') as f:
       for i, sentence in enumerate(sentences, 1):
           # Caption number
           f.write(f"{i}\n")
           
           # Timing line
           start = ms_to_srt_time(sentence.start)
           end = ms_to_srt_time(sentence.end)
           f.write(f"{start} --> {end}\n")
           
           # The actual caption text
           text = sentence.text.strip()
           
           # Break long lines for better readability
           if len(text) > 42:
               # Find a natural break point
               mid = len(text) // 2
               break_point = text.rfind(' ', mid - 10, mid + 10)
               if break_point == -1:  # No good break found
                   break_point = mid
               
               line1 = text[:break_point].strip()
               line2 = text[break_point:].strip()
               f.write(f"{line1}\n{line2}\n")
           else:
               f.write(f"{text}\n")
           
           # Blank line separates each caption
           f.write("\n")
   
   print(f"SRT file saved as: {output_path}")

# Use it with your transcript
create_srt_file(transcript, "my_captions.srt")

Line length matters: Most caption guidelines recommend 32-42 characters per line so text doesn't overwhelm the video.

Write WebVTT from sentence timestamps

WebVTT files are nearly identical to SRT but with slightly different formatting:

def create_webvtt_file(transcript, output_path="captions.vtt"):
   """Create a WebVTT caption file from your transcript"""
   
   def ms_to_webvtt_time(milliseconds):
       """Convert milliseconds to WebVTT format: 00:00:00.000"""
       total_seconds = milliseconds / 1000
       hours = int(total_seconds // 3600)
       minutes = int((total_seconds % 3600) // 60)
       seconds = int(total_seconds % 60)
       ms = int((total_seconds % 1) * 1000)
       
       return f"{hours:02d}:{minutes:02d}:{seconds:02d}.{ms:03d}"
   
   sentences = transcript.get_sentences()
   if not sentences:
       print("No sentences found in the transcript.")
       return
   
   with open(output_path, 'w', encoding='utf-8') as f:
       # WebVTT files must start with this header
       f.write("WEBVTT\n\n")
       
       for i, sentence in enumerate(sentences, 1):
           # Optional identifier
           f.write(f"{i}\n")
           
           # Timing (note the period instead of comma)
           start = ms_to_webvtt_time(sentence.start)
           end = ms_to_webvtt_time(sentence.end)
           f.write(f"{start} --> {end}\n")
           
           # Caption text (same line-breaking as SRT)
           text = sentence.text.strip()
           
           if len(text) > 42:
               mid = len(text) // 2
               break_point = text.rfind(' ', mid - 10, mid + 10)
               if break_point == -1:
                   break_point = mid
               
               line1 = text[:break_point].strip()
               line2 = text[break_point:].strip()
               f.write(f"{line1}\n{line2}\n")
           else:
               f.write(f"{text}\n")
           
           f.write("\n")
   
   print(f"WebVTT file saved as: {output_path}")

# Create both formats from the same transcript
create_webvtt_file(transcript, "my_captions.vtt")

Both functions work with the same transcript data—you're just changing the output format.

Test timestamps and captions in your browser

Upload a file and preview word and sentence timestamps, then validate your SRT or WebVTT output before integrating code.

Open playground

How do I add speaker labels with diarization?

Speaker diarization identifies who's talking when in your audio. This means your captions can show "Speaker A:" or "John:" before each person's words, making conversations much easier to follow.

Enable diarization by adding one parameter to your config:

# Configure transcription with speaker identification
config = aai.TranscriptionConfig(
   speaker_labels=True,    # This enables speaker diarization
   word_timestamps=True    # Optional: word-level timing too
)

# Transcribe as usual
transcriber = aai.Transcriber()
transcript = transcriber.transcribe("conversation.mp3", config=config)

# Now you get speaker information
if transcript.utterances:
   for utterance in transcript.utterances[:3]:  # First 3 speaker turns
       start_sec = utterance.start / 1000
       end_sec = utterance.end / 1000
       print(f"Speaker {utterance.speaker} ({start_sec:.1f}s-{end_sec:.1f}s):")
       print(f"  '{utterance.text}'\n")

With diarization enabled, you get utterances instead of sentences—each utterance represents one speaker talking continuously.

Here's how to create captions with speaker labels:

def create_srt_with_speakers(transcript, output_path="speaker_captions.srt", speaker_names=None):
   """Create SRT captions with speaker identification"""
   
   def ms_to_srt_time(ms):
       total_seconds = ms / 1000
       hours = int(total_seconds // 3600)
       minutes = int((total_seconds % 3600) // 60)
       seconds = int(total_seconds % 60)
       milliseconds = int((total_seconds % 1) * 1000)
       return f"{hours:02d}:{minutes:02d}:{seconds:02d},{milliseconds:03d}"
   
   if not transcript.utterances:
       print("No speaker utterances found - make sure speaker_labels=True")
       return
   
   # Optional: map generic labels to real names
   if not speaker_names:
       speaker_names = {}  # Will use generic "Speaker A", "Speaker B", etc.
   
   with open(output_path, 'w', encoding='utf-8') as f:
       for i, utterance in enumerate(transcript.utterances, 1):
           f.write(f"{i}\n")
           
           # Timing
           start = ms_to_srt_time(utterance.start)
           end = ms_to_srt_time(utterance.end)
           f.write(f"{start} --> {end}\n")
           
           # Get speaker name (real name or generic label)
           speaker_id = utterance.speaker
           speaker_name = speaker_names.get(speaker_id, f"Speaker {speaker_id}")
           
           # Format the text with speaker label
           full_text = f"{speaker_name}: {utterance.text}"
           
           # Handle line breaks (speaker label might make line longer)
           if len(full_text) > 42:
               # Try to keep speaker label on first line
               speaker_part = f"{speaker_name}: "
               remaining_budget = 42 - len(speaker_part)
               
               if remaining_budget > 15:  # Enough room for some text
                   # Find break point in the utterance text
                   utterance_text = utterance.text
                   break_point = utterance_text.find(' ', remaining_budget - 5)
                   
                   if break_point > 0 and break_point < remaining_budget + 5:
                       line1 = f"{speaker_part}{utterance_text[:break_point]}"
                       line2 = utterance_text[break_point:].strip()
                       f.write(f"{line1}\n{line2}\n")
                   else:
                       # Put speaker label on its own line
                       f.write(f"{speaker_name}:\n{utterance.text}\n")
               else:
                   # Speaker name is too long for same line
                   f.write(f"{speaker_name}:\n{utterance.text}\n")
           else:
               f.write(f"{full_text}\n")
           
           f.write("\n")
   
   print(f"Speaker-labeled SRT saved as: {output_path}")

# Use with generic speaker labels
create_srt_with_speakers(transcript)

# Or map to real names if you know who's speaking
speaker_mapping = {
   "A": "Alice Johnson",
   "B": "Bob Smith",
   "C": "Charlie Brown"
}
create_srt_with_speakers(transcript, "named_speakers.srt", speaker_mapping)

Common speaker label formats you can choose:

  • Full format: "Speaker A: Hello everyone"
  • Short format: "A: Hello everyone"
  • Separate line: Put speaker name on its own line above the text
  • Real names: Replace generic labels with actual names when known

The diarization works best with clear audio where speakers don't talk over each other. It can usually handle 2-10 different speakers, though accuracy drops with more speakers or background noise.

Final words

You now have a complete pipeline that takes any audio file and produces professional caption files with precise timestamps. The workflow is straightforward: configure AssemblyAI's speech-to-text API with timestamp options, submit your audio for transcription, then convert the timing data into SRT or WebVTT caption formats that work with any video platform.

AssemblyAI's Voice AI models handle the complex work of speech recognition and timestamp alignment, delivering the accuracy needed for professional captioning workflows. The API processes multiple audio formats, identifies different speakers automatically, and returns millisecond-precise timing data that translates directly into caption files—giving you broadcast-quality results without the manual transcription work.

Build captioning into your app today

Use the Python SDK to create timestamped transcripts, generate SRT/WebVTT files, and add speaker labels—all with the AssemblyAI API.

Start free

FAQ

How do I convert word-level timestamps into sentence-based captions?

After transcribing your audio, you can call the `.get_sentences()` method on the completed transcript object. This will return a list of sentences, each with its own text, start time, and end time, without needing to set any special parameters during transcription.

What audio file formats work with timestamp transcription?

AssemblyAI accepts MP3, MP4, WAV, FLAC, M4A, and most other common audio and video formats. The system automatically extracts audio tracks from video files, so you can upload MP4 videos directly without converting them first.

How do I adjust caption timing when the timestamps don't match my video exactly?

Add or subtract a fixed offset from all timestamps before generating your caption files. Most timing mismatches come from different start points between your audio file and final video—a simple time shift usually fixes synchronization issues.

Can I get timestamps in seconds instead of milliseconds from the API?

The API returns all timestamps in milliseconds, but you can easily convert them by dividing by 1000. This gives you decimal seconds that are easier to work with if you're building custom timing displays or doing manual calculations.

How accurate is speaker diarization for identifying different voices in my audio?

Speaker diarization works best with 2-4 clearly distinct speakers in good quality audio. Accuracy decreases with background noise, similar-sounding voices, or when speakers talk over each other—but it typically identifies speaker changes correctly in most conversation scenarios.

Title goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Button Text
Speech-to-Text
Automatic Speech Recognition