How to Convert an MP3 File to Text with an API

In this tutorial I’m going to show you how to convert an MP3 file to text with an API. AssemblyAI makes a free, fast, simple to use Speech-to-Text API. Being able to convert mp3 files to text by simply using an API has only become possible in the last few years. That’s because Automatic Speech Transcription technology has gotten way more accurate over the past few years, and is now nearly as accurate as a human.

Let’s get started!

A Quick Primer on Automatic Transcription

A few years ago, automatic transcription technology wasn't really available to regular software developers like you and me. It was technology reserved for huge enterprise companies of the likes of Apple and BMW. Nowadays, this technology is available to developers through simple to use APIs akin to Twilio or Stripe.

And with the recent advances in Deep Learning, the accuracy of Speech-to-Text technology is quickly approaching human level. Today, in just a few lines of code, we can use an API to convert our own mp3 files to text with human level accuracy.

In this example, we'll use AssemblyAI's API for automatic transcription. AssemblyAI’s API is not only free, fast, and super simple to use, but also comes with a bunch of plug and play features. In the section below I’ll show you how to convert an mp3 file to text using AssemblyAI’s API.

Learn More: What is ASR?

Convert an MP3 File to Text

To start converting an mp3 file to text, you’ll need to get an API key for AssemblyAI’s speech to text API. Once you sign up, you can find your API key located in the console where I’ve circled in red in the picture below. You should store this as an environment variable or a variable in a separate configuration file.

If you don’t already have an mp3 file downloaded to start, I have an mp3 file you can download. I chose a video on how our brains process speech, a TED-ed talk by Gareth Gaskell. It’s not about how to convert your mp3 file to text, but it is interesting. It talks about how we, as humans (not the machines), understand language. He covers how many words people know on average, how scientists think our brains recognize language, and how we acquire new words.

Alright, on to how we actually convert this mp3 file to a text file with AssemblyAI’s speech recognition API. The entire process can be broken down into 3 simple steps:

Upload mp3 file to AssemblyAI’s API
Start transcription job
Get result of transcription job

Now to the code!

We’ll need to import our API key or define it inline, as shown below. Then, we’ll define the headers we’ll include in our API calls to AssemblyAI, which is where we’ll include our API key. To upload our mp3 file to AssemblyAI, we simply make a request to the AssemblyAI upload endpoint and send a POST request with the headers we created earlier and data using a generator function that will read our mp3 file as bytes and return the data.

auth_key = '<your AssemblyAI API key here>'

headers = {
    "authorization": auth_key,
    "content-type": "application/json”
}

def read_file(filename):
   with open(filename, 'rb') as _file:
       while True:
           data = _file.read(5242880)
           if not data:
               break
           yield data
 
upload_response = requests.post('https://api.assemblyai.com/v2/upload', headers=headers, data=read_file('<path to your file here>'))
audio_url = upload_response.json()['upload_url']

In the JSON response, there will be an upload_url key that points to the file we uploaded to AssemblyAI. This file is only accessible to AssemblyAI’s servers, so you won’t be able to access this URL in your browser.

Below, we’ll pass to the transcription endpoint (also with the headers we used in our prior request ) the upload_url, which tells AssemblyAI to convert our mp3 file to text.

transcript_request = {'audio_url': audio_url}
endpoint = "https://api.assemblyai.com/v2/transcript"
transcript_response = requests.post(endpoint, json=transcript_request, headers=headers)
_id = transcript_response.json()['id']

Once the transcription request is processed, we will get back a JSON response which will have an id. We’ll need to save the id so that we can poll the polling endpoint to check the status of our transcription. The polling endpoint is created from the transcription endpoint by adding the id we received from our initial transcription response. Once we get a response back from our polling endpoint, we need to check the status of the transcript to see if it’s completed. If the transcript is not completed we should print out the polling endpoint’s response to check on the transcript status and make sure there hasn’t been any errors. Once the transcript is completed, we can save the text to a text file!

endpoint = "https://api.assemblyai.com/v2/transcript/" + _id
polling_response = requests.get(endpoint, headers=headers)
if polling_response.json()['status'] != 'completed':
   print(polling_response.json())
else:
   with open(_id + '.txt', 'w') as f:
       f.write(polling_response.json()['text'])
   print('Transcript saved to', _id, '.txt')

It’s as simple as that. All we have to do to convert an mp3 file to text using AssemblyAI’s speech to text API is get an API key, upload our mp3 file to the API, and send make 2 simple API calls!

Conclusion

Nowadays, any developer can access speech recognition technology through the use of a cloud API like AssemblyAI’s speech recognition API. We demonstrated how you can use AssemblyAI’s API to convert an mp3 file into text. For more information on speech recognition technology, follow us on Twitter @assemblyai and @yujian_tang!