November 10, 2025

Python Speech-to-Text with Punctuation, Casing, and Formatting

Learn how to transcribe audio and video files into text that contains punctuation, casing and formatting using the AssemblyAI Python SDK.

Tutorial

Python

Matt Makai

Reviewed by

No items found.

Table of contents

[Visible on live site]

When you need to convert speech to text in Python, you'll find a lot of options. Many tutorials point to basic open-source libraries that are great for simple projects, but as industry analysis shows, they often fall short in production environments where accuracy, speed, and reliability are critical. Getting clean, formatted text with proper punctuation and casing is often a significant challenge that requires extra work.

This tutorial cuts through the noise. We'll show you how to get production-ready transcripts using the AssemblyAI Python SDK. You'll learn how to transcribe audio files and get back text that's already formatted with punctuation and casing, saving you significant post-processing time. We'll cover everything from setting up your environment to handling different audio sources and preparing your code for real-world applications.

Project Prerequisites

Python speech-to-text requires three core components: Python 3.8+, the AssemblyAI SDK, and an API key.

Here's what you'll need:

Python 3.8 or newer
The AssemblyAI Python SDK
An AssemblyAI API key, which can be copied from the AssemblyAI dashboard

All code in this blog post is available open source under the MIT license on GitHub under the transcribe-english-punctuation directory of the assemblyai-examples Git repository. You may therefore use the source code as you desire for your own projects.

Python Development Environment Configuration

Change into the directory where you keep your Python virtual environments. Create a new virtualenv for this project using the following command:

python -m venv transcriber

Activate the virtual environment with the activation script on macOS or Linux:

source ./transcriber/bin/activate

On Windows, the virtual environment can be activated with the activate.bat script:

.\transcriber\Scripts\activate.bat

Install the AssemblyAI Python SDK package:

pip install assemblyai

Log into your AssemblyAI account, or sign up for a new account if you don't already have one. Find and click the "Copy token" button on the dashboard.

Switch back to your console and paste the API token into the string value of the environment variable for ASSEMBLYAI_API_KEY. If you are on Windows, you can set an environment variable through the Windows settings. If you are on MacOS/Linux, run this command to set the ASSEMBLYAI_API_KEY environment variable:

export ASSEMBLYAI_API_KEY='' # paste this value from assemblyai.com/dashboard

Audio Input Options for Python Speech-to-Text

Before we write our transcription script, it's important to know what kind of audio sources you can work with. The AssemblyAI Python SDK is flexible and supports several common scenarios that developers face.

Local audio files: You can directly upload and transcribe audio files (like MP3, WAV, M4A, and more) from your local machine.
Remote audio files: If your file is hosted online, you can provide a public URL for transcription, which is the method we'll use in our main example.
Real-time audio streams: For applications that need immediate transcription, like live captioning or voice command systems, you can use our streaming transcription model. While this tutorial focuses on file-based transcription, our documentation provides clear guidance for real-time use cases.

Test speech-to-text on your audio

Upload a file or paste a URL to see automatic punctuation and casing with AssemblyAI.

Open playground

This flexibility means you can use the same core logic whether you're building an asynchronous file processing pipeline or an interactive voice application.

Transcription Python Code

Our dependencies are installed and the environment variable is set, so next we will write the Python code to handle the transcription.

Our transcription Python script will be compact given that we can use AssemblyAI Python SDK, which abstracts calling the AssemblyAI API and reduces the amount of required boilerplate code. Create a new file named transcriber.py and copy in the following code along with the comments for each line.

import assemblyai as aai # The URL of the file we want to transcribe URL = "https://talkpython.fm/episodes/download/356/tips-for-ml-ai-startups.mp3" # The file to write the results to, or leave empty to print to console OUTPUT_FILENAME = "podcast.txt" # Instantiate the AssemblyAI Python SDK. # It will automatically use the ASSEMBLYAI_API_KEY environment variable. # You can also set it manually with: aai.settings.api_key = "YOUR_API_KEY" transcriber = aai.Transcriber() # Call the API. This will block until the API call finishes. # Punctuation and formatting are enabled by default. transcript = transcriber.transcribe(URL) # Write to a file or print to console if no OUTPUT_FILENAME is set if OUTPUT_FILENAME: with open(OUTPUT_FILENAME, "w") as file: file.write(transcript.text) else: print(transcript.text)

The code imports assemblyai as aai and sets two key variables:

URL: The audio file location (remote URL or local path)
OUTPUT_FILENAME: Where to save results (optional—prints to console if empty)

We're transcribing a Talk Python episode as our example audio source.

The code is simple because AssemblyAI enables key formatting features like automatic punctuation and casing by default. You don't need to pass any special configuration to get a production-ready transcript.

The Transcriber object handles API authentication and requests.

API key priority:

Automatic: Reads from ASSEMBLYAI_API_KEY environment variable
Manual: Set via aai.settings.api_key = "your_key"

The transcribe() method processes your audio synchronously—blocking until complete.

For asynchronous processing, use webhooks instead.

Output handling: saves to file if OUTPUT_FILENAME is set, otherwise prints to console.

Running the Transcription Code

Ensure the transcriber.py file is saved and that your virtual environment is still activated. Navigate to the project directory in a terminal and run the transcriber.py file with the following command:

python transcriber.py

Once the script has finished executing, you will either see the transcription printed to your Console or put into a file (by default, podcast.txt), with the following contents:

Have you been considering launching a product or even a business based on Python's AI ML Stack? We have a great guest on this episode, Dylan Fox, who is the founder of AssemblyAI and has been building his startup successfully over the past few years. He has interesting stories of hundreds of GPUs in the cloud, evolving ML models, and much more that I know you can enjoy hearing....

Build Python STT with production accuracy

Get an API key and start transcribing with automatic punctuation and casing using the Python SDK.

Get API key

Error Handling and Troubleshooting

In a production environment, things can go wrong. A file URL might be invalid, an API key could be incorrect, or the audio file itself might be corrupted. Building robust error handling into your script is essential, especially since a recent survey found that over 30% of tech leaders see data privacy and security as a significant challenge when building with speech recognition.

The AssemblyAI SDK raises specific exceptions that you can catch to handle these issues gracefully. Here's how you can modify the transcription call within a try...except block:

import assemblyai as aai # Your previous setup code here... try: transcript = transcriber.transcribe(URL, config) if transcript.status == aai.TranscriptStatus.error: print(f"Transcription failed: {transcript.error}") else: print(transcript.text) except aai.errors.APIError as e: print(f"AssemblyAI API Error: {e}")

By checking transcript.status, you can handle cases where the transcription job is accepted but fails during processing. Wrapping the call in a try...except aai.errors.APIError block helps you catch issues like authentication failures or network problems before the transcription even starts.

Performance and Cost Considerations

When building with any API, it's smart to think about performance and cost. The code in this tutorial uses a synchronous, or blocking, call. This is simple and works well for quick tests or smaller files, but for large files, it will tie up your script until the transcription is complete.

For production applications, use asynchronous processing:

Webhook approach: Submit job, receive results via HTTP callback
Non-blocking: Your application continues while transcription processes
Scalable: Handle multiple files simultaneously

On the cost side, AssemblyAI's pay-as-you-go pricing is designed for innovation. There are no upfront commitments, making it easy for developers at companies like Veed and CallSource to start building and scale as they grow. This model avoids the significant overhead and unpredictable costs associated with deploying and maintaining open source models. As new industry research highlights, these DIY approaches require dedicated infrastructure, top-tier AI talent, and ongoing engineering resources for maintenance and updates.

Next Steps for Python Speech-to-Text Development

You've successfully transcribed an audio file and received a clean, formatted transcript. But this is just the beginning. A transcript is the foundation for deeper speech understanding.

What can you do next? With the same API, you can enable powerful features to extract more insights from your audio data:

Speaker Diarization: Identify who said what and when.
Sentiment Analysis: Detect the sentiment of each sentence in your transcript.
Summarization: Generate summaries of your audio content in different formats.
Entity Detection: Automatically identify key entities like names, dates, and locations.

These capabilities allow you to move beyond simple transcription and build sophisticated Voice AI features directly into your applications. To explore these features and see what else is possible, try our API for free.

Frequently Asked Questions about Python Speech-to-Text Implementation

How do I transcribe a local audio file instead of a URL?

Pass the local file path to transcribe()—the SDK automatically handles upload.

# The SDK automatically handles local file uploads. No config is needed for default transcription. transcript = transcriber.transcribe('./path/to/your/audio.mp3')

What's the best way to handle very large audio files?

Use asynchronous transcription with webhooks to avoid blocking your application during processing.

Can I perform real-time speech-to-text with Python?

Yes, use AssemblyAI's streaming model to transcribe microphone input with low latency. According to the streaming API documentation, you can receive transcription results within a few hundred milliseconds.

How does AssemblyAI compare to open source Python libraries?

Compared to open source libraries that often lack advanced features, feature-rich cloud solutions like AssemblyAI provide higher accuracy and production-ready capabilities like speaker diarization and summarization.