Start now for free

Python Speech Recognition in Under 25 Lines of Code

In this simple tutorial, we show you how to do Speech Recognition in Python in under 25 lines of code. Let's get started!

Python Speech Recognition in Under 25 Lines of Code

A Brief History of Speech Recognition in Python

Speech recognition started in Bell labs in the 1950s and has become an ever more popular and important topic in recent years. With the advent of personal assistants like Siri, Alexa, and others, the importance of the ability for machines to be able to process speech has become more and more clear. Today there are many ways to do speech recognition programmatically in Python. Open source libraries such as wav2letter, Mozilla DeepSpeech, and Wav2Letter provide ways for developers to do speech to text without having to create complex machine learning models.

However, these open source libraries leave some things to be desired, such as accuracy, ease of use, and further insight into the transcribed text. AssemblyAI is a free API for Automatic Speech-to-Text that was established specifically to deal with these issues. In this tutorial, we show you how to use the AssemblyAI Speech-to-Text API to transcribe your audio and video files in Python, in just 25 lines of code.

Prerequisites

I hope you like speed running, because we’re about to speed run speech recognition in Python 3. For this tutorial, we’ll need:

  • An AssemblyAI API key
  • An mp3 file
  • Jupyter Notebook

Our final product (a short Python script) can be found here.

Before we start, let’s go over how to get an AssemblyAI API key and Jupyter Notebook. AssemblyAI is an API for fast, automatic speech recognition. To get an AssemblyAI API key, sign up for a free AssemblyAI account. Your API key will be where I circled and blocked in red.

There are multiple ways to get Jupyter Notebook, personally I installed the Jupyter plugin on VSCode, but if you prefer to use Jupyter Notebook in its raw form, you can run

pip install notebook

You can open your Jupyter Notebook in VSCode if you use it, or in the terminal run:

jupyter notebook

Python Speech Recognition Code

‍From here on out, every block of code corresponds to a block in our Jupyter notebook. First, our imports: Line 1:

import requests

Now we’re going to add our authorization key and upload our mp3 file to AssemblyAI’s hosting service. We do this so we can send the url to be transcribed. The next 11 lines:

auth_key = '<your AssemblyAI API key here>'
headers = {"authorization": auth_key, "content-type": "application/json"}
def read_file(filename):
   with open(filename, 'rb') as _file:
       while True:
           data = _file.read(5242880)
           if not data:
               break
           yield data
 
upload_response = requests.post('https://api.assemblyai.com/v2/upload', headers=headers, data=read_file('<path to your file here>'))
audio_url = upload_response.json()["upload_url"]

For reference, an upload response will look like this:

{'upload_url': 'https://cdn.assemblyai.com/upload/63928cd3-152e-4024-8e28-fd7174ec0b4d'}

Now we’ll send our uploaded link to be transcribed. The next 3 lines:

transcript_request = {'audio_url': audio_url}
transcript_response = requests.post("https://api.assemblyai.com/v2/transcript", json=transcript_request, headers=headers)
_id = transcript_response.json()["id"]

‍For reference, the transcription endpoint will return a response like this:

{'id': 'dwjpm0yok-3d85-4d34-ab96-600ed6c37cda', 
'language_model': 'assemblyai_default',
'acoustic_model': 'assemblyai_default',
'status': 'queued',
'audio_url': 'https://cdn.assemblyai.com/upload/63928cd3-152e-4024-8e28-fd7174ec0b4d',
'text': None,
'words': None,
'utterances': None,
'confidence': None,
'audio_duration': None,
'punctuate': True,
'format_text': True,
'dual_channel': None,
'webhook_url': None,
'webhook_status_code': None,
'speed_boost': False,
'auto_highlights_result': None,
'auto_highlights': False,
'audio_start_from': None,
'audio_end_at': None,
'word_boost': [],
'boost_param': None,
'filter_profanity': False,
'redact_pii': False,
'redact_pii_audio': False,
'redact_pii_audio_quality': None,
'redact_pii_policies': None,
'redact_pii_sub': None,
'speaker_labels': False,
'content_safety': False,
'iab_categories': False,
'content_safety_labels': {},
'iab_categories_result': {}}

We’ll need to keep track of the id to poll the endpoint to get our final .txt file with our transcribed text. These are the next 7 lines:

polling_response = requests.get("https://api.assemblyai.com/v2/transcript/" + _id, headers=headers)
if polling_response.json()['status'] != 'completed':
   print(polling_response.json())
else:
   with open(_id + '.txt', 'w') as f:
       f.write(polling_response.json()['text'])
   print('Transcript saved to', _id, '.txt')

And we’re done, there you go! We’ve written something in Python that will do speech recognition in 1 + 11 + 3 + 7 = 22 lines. Take note that you’ll have to run the last code block a couple of times on its own to check if the status of our transcription is complete.

Extending to an automatic Python speech recognition script

Now that we’ve briefly explored how to do speech recognition in Python, we’ll go over how to make this Jupyter Notebook into a Python script that will automatically poll the endpoint until our transcription is done. When we’re done, it should look like this:

There’s only a few more lines to add. What we’ll do is take our code blocks above and combine them into one script where the name of the mp3 file we’re using will be the argument we pass it. Then we’ll add an automatic sleep timer for our script to periodically poll the transcript endpoint until the transcript is complete. We can do this in a pretty short script too, only 36 lines of code:

import sys
from configure import auth_key
import requests
import pprint
from time import sleep
 
# store global constants
headers = {
   "authorization": auth_key,
   "content-type": "application/json"
}
transcript_endpoint = "https://api.assemblyai.com/v2/transcript"
upload_endpoint = 'https://api.assemblyai.com/v2/upload'
 
# make a function to pass the mp3 to the upload endpoint
def read_file(filename):
   with open(filename, 'rb') as _file:
       while True:
           data = _file.read(5242880)
           if not data:
               break
           yield data
 
# upload our audio file
upload_response = requests.post(
   upload_endpoint,
   headers=headers, data=read_file(sys.argv[1])
)
print('Audio file uploaded')
 
# send a request to transcribe the audio file
transcript_request = {'audio_url': upload_response.json()['upload_url']}
transcript_response = requests.post(transcript_endpoint, json=transcript_request, headers=headers)
print('Transcription Requested')
pprint.pprint(transcript_response.json())
# set up polling
polling_response = requests.get(transcript_endpoint+"/"+transcript_response.json()['id'], headers=headers)
filename = transcript_response.json()['id'] + '.txt'
# if our status isn’t complete, sleep and then poll again
while polling_response.json()['status'] != 'completed':
   sleep(30)
   polling_response = requests.get(transcript_endpoint+"/"+transcript_response.json()['id'], headers=headers)
   print("File is", polling_response.json()['status'])
with open(filename, 'w') as f:
   f.write(polling_response.json()['text'])
print('Transcript saved to', filename)

‍To extend what we’ve built here today to a command line tool, check out this code. To extend it even further, check out this project that will download a YouTube video and transcribe it.

AssemblyAI is a top rated API for speech recognition. To learn more about AssemblyAI follow @assemblyai on Twitter, to keep up with me, the writer, follow @yujian_tang.