Python Speech Recognition in Under 25 Lines of Code

Python Speech Recognition in Under 25 Lines of Code
Share on social icon.Share on social icon.Share on social icon.Share on social icon.

I hope you like speed running because we’re about to speed run speech recognition in Python 3. For our speedrun we’ll need:

  • An AssemblyAI API key
  • An mp3 file
  • Jupyter Notebook

Our final product (a short Python script) can be found here.

Before we start, let’s go over how to get an AssemblyAI API key and Jupyter Notebook.

AssemblyAI is an API for fast, automatic speech recognition. To get an AssemblyAI API key, we go to and sign up for an account to get a free API key. Your API key will be where I circled and blocked in red.

There are multiple ways to get Jupyter Notebook, personally I installed the Jupyter plugin on VSCode, but if you prefer to use Jupyter Notebook in its raw form, you can run 

pip install notebook

You can open your Jupyter Notebook in VSCode if you use it, or in the terminal run:

jupyter notebook

From here on out, every block of code corresponds to a block in our Jupyter notebook. First, our imports: Line 1:

import requests

Now we’re going to add our authorization key and upload our mp3 file to AssemblyAI’s hosting service. We do this so we can send the url to be transcribed. The next 11 lines:

auth_key = '<your AssemblyAI API key here>'
headers = {"authorization": auth_key, "content-type": "application/json"}
def read_file(filename):
   with open(filename, 'rb') as _file:
       while True:
           data =
           if not data:
           yield data
upload_response ='', headers=headers, data=read_file('<path to your file here>'))
audio_url = upload_response.json()["upload_url"]

For reference, an upload response will look like this:

{'upload_url': ''}

Now we’ll send our uploaded link to be transcribed. The next 3 lines:

transcript_request = {'audio_url': audio_url}
transcript_response ="", json=transcript_request, headers=headers)
_id = transcript_response.json()["id"]

For reference, the transcription endpoint will return a response like this:

{'id': 'dwjpm0yok-3d85-4d34-ab96-600ed6c37cda', 
'language_model': 'assemblyai_default',
'acoustic_model': 'assemblyai_default',
'status': 'queued',
'audio_url': '',
'text': None,
'words': None,
'utterances': None,
'confidence': None,
'audio_duration': None,
'punctuate': True,
'format_text': True,
'dual_channel': None,
'webhook_url': None,
'webhook_status_code': None,
'speed_boost': False,
'auto_highlights_result': None,
'auto_highlights': False,
'audio_start_from': None,
'audio_end_at': None,
'word_boost': [],
'boost_param': None,
'filter_profanity': False,
'redact_pii': False,
'redact_pii_audio': False,
'redact_pii_audio_quality': None,
'redact_pii_policies': None,
'redact_pii_sub': None,
'speaker_labels': False,
'content_safety': False,
'iab_categories': False,
'content_safety_labels': {},
'iab_categories_result': {}}

We’ll need to keep track of the id to poll the endpoint to get our final .txt file with our transcribed text. These are the next 7 lines:

polling_response = requests.get("" + _id, headers=headers)
if polling_response.json()['status'] != 'completed':
   with open(_id + '.txt', 'w') as f:
   print('Transcript saved to', _id, '.txt')

And we’re done, there you go! We’ve written something in Python that will do speech recognition in 1 + 11 + 3 + 7 = 22 lines. Take note that you’ll have to run the last code block a couple of times on it’s own to check if the status of our transcription is complete.

Now that we’ve briefly explored how to do speech recognition in Python, we’ll go over how to make this Jupyter Notebook into a Python script that will automatically poll the endpoint until our transcription is done. When we’re done, it should look like this:

There’s only a few more lines to add. What we’ll do is take our code blocks above and combine them into one script where the name of the mp3 file we’re using will be the argument we pass it. Then we’ll add an automatic sleep timer for our script to periodically poll the transcript endpoint until the transcript is complete. We can do this in a pretty short script too, only 36 lines of code:

import sys
from configure import auth_key
import requests
import pprint
from time import sleep
# store global constants
headers = {
   "authorization": auth_key,
   "content-type": "application/json"
transcript_endpoint = ""
upload_endpoint = ''
# make a function to pass the mp3 to the upload endpoint
def read_file(filename):
   with open(filename, 'rb') as _file:
       while True:
           data =
           if not data:
           yield data
# upload our audio file
upload_response =
   headers=headers, data=read_file(sys.argv[1])
print('Audio file uploaded')
# send a request to transcribe the audio file
transcript_request = {'audio_url': upload_response.json()['upload_url']}
transcript_response =, json=transcript_request, headers=headers)
print('Transcription Requested')
# set up polling
polling_response = requests.get(transcript_endpoint+"/"+transcript_response.json()['id'], headers=headers)
filename = transcript_response.json()['id'] + '.txt'
# if our status isn’t complete, sleep and then poll again
while polling_response.json()['status'] != 'completed':
   polling_response = requests.get(transcript_endpoint+"/"+transcript_response.json()['id'], headers=headers)
   print("File is", polling_response.json()['status'])
with open(filename, 'w') as f:
print('Transcript saved to', filename)

A Brief History of Speech Recognition in Python

Speech recognition started in Bell labs in the 1950s and has become an ever more popular and important topic in recent years. With the advent of personal assistants like Siri, Alexa, and others, the importance of the ability for machines to be able to process speech has become more and more clear. Today there are many ways to do speech recognition programmatically in Python. Open source libraries such as wav2letter, Mozilla DeepSpeech, and Wav2Letter provide ways for developers to do speech to text without having to create complex machine learning models.

However, these open source libraries leave some things to be desired, such as accuracy, ease of use, and further insight into the transcribed text. AssemblyAI was established specifically to deal with the issues that we found difficult to deal with when creating our own speech recognition system and is designed to be fast, flexible, and powerful. AssemblyAI’s API allows not only speech recognition and transcription, but also provides a simple way to also redact PII from the transcription, summarization, topic categorization, and much more.

To extend what we’ve built here today to a command line tool, check out this code. To extend it even further, check out this project that will download a YouTube video and transcribe it.

AssemblyAI is a top rated API for speech recognition. To learn more about AssemblyAI follow @assemblyai on Twitter, to keep up with me, the writer, follow @yujian_tang.

Subscribe to our blog!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

You may also like

Checkout some of our recent research and product updates

How to build a YouTube Downloader in Python
How to build a YouTube downloader in Python

How to get the transcript of a YouTube video
How to get the transcript of a YouTube video

In this blog post, I'm going to show you how to build a command line tool that will download a video from a YouTube link and extract the transcription for you via AssemblyAI in Python 3!

Fine-Tuning Transformers for NLP
Deep Learning
Fine-Tuning Transformers for NLP

Since being first developed and released in the Attention Is All You Need paper Transformers have completely redefined the field of Natural Language Processing. In this blog, we show you how to quickly fine-tune Transformers for numerous downstream tasks, that often perform really well out of the box.


Unlock your media with our advanced features like PII Redaction,
Keyword Boosts, Automatic Transcript Highlights, and more