I hope you like speed running because we’re about to speed run speech recognition in Python 3. For our speedrun we’ll need:
- An AssemblyAI API key
- An mp3 file
- Jupyter Notebook
Our final product (a short Python script) can be found here.
Before we start, let’s go over how to get an AssemblyAI API key and Jupyter Notebook.
AssemblyAI is an API for fast, automatic speech recognition. To get an AssemblyAI API key, we go to assemblyai.com and sign up for an account to get a free API key. Your API key will be where I circled and blocked in red.
There are multiple ways to get Jupyter Notebook, personally I installed the Jupyter plugin on VSCode, but if you prefer to use Jupyter Notebook in its raw form, you can run
You can open your Jupyter Notebook in VSCode if you use it, or in the terminal run:
From here on out, every block of code corresponds to a block in our Jupyter notebook. First, our imports: Line 1:
Now we’re going to add our authorization key and upload our mp3 file to AssemblyAI’s hosting service. We do this so we can send the url to be transcribed. The next 11 lines:
For reference, an upload response will look like this:
Now we’ll send our uploaded link to be transcribed. The next 3 lines:
For reference, the transcription endpoint will return a response like this:
We’ll need to keep track of the id to poll the endpoint to get our final .txt file with our transcribed text. These are the next 7 lines:
And we’re done, there you go! We’ve written something in Python that will do speech recognition in 1 + 11 + 3 + 7 = 22 lines. Take note that you’ll have to run the last code block a couple of times on it’s own to check if the status of our transcription is complete.
Now that we’ve briefly explored how to do speech recognition in Python, we’ll go over how to make this Jupyter Notebook into a Python script that will automatically poll the endpoint until our transcription is done. When we’re done, it should look like this:
There’s only a few more lines to add. What we’ll do is take our code blocks above and combine them into one script where the name of the mp3 file we’re using will be the argument we pass it. Then we’ll add an automatic sleep timer for our script to periodically poll the transcript endpoint until the transcript is complete. We can do this in a pretty short script too, only 36 lines of code:
A Brief History of Speech Recognition in Python
Speech recognition started in Bell labs in the 1950s and has become an ever more popular and important topic in recent years. With the advent of personal assistants like Siri, Alexa, and others, the importance of the ability for machines to be able to process speech has become more and more clear. Today there are many ways to do speech recognition programmatically in Python. Open source libraries such as wav2letter, Mozilla DeepSpeech, and Wav2Letter provide ways for developers to do speech to text without having to create complex machine learning models.
However, these open source libraries leave some things to be desired, such as accuracy, ease of use, and further insight into the transcribed text. AssemblyAI was established specifically to deal with the issues that we found difficult to deal with when creating our own speech recognition system and is designed to be fast, flexible, and powerful. AssemblyAI’s API allows not only speech recognition and transcription, but also provides a simple way to also redact PII from the transcription, summarization, topic categorization, and much more.
To extend what we’ve built here today to a command line tool, check out this code. To extend it even further, check out this project that will download a YouTube video and transcribe it.