Blog

Real Time Speech Recognition with Python

Tutorials
Share on social icon.Share on social icon.Share on social icon.Share on social icon.

Overview:

In this tutorial, we’ll be using AssemblyAI’s real time transcription to transcribe from the microphone in real time. Please note that this is a paid feature. We’ll be using the Python PyAudio library to stream the sound from our microphone. We’ll be using the python websockets library to connect to AssemblyAI’s streaming websocket endpoint. First we’ll go over how to install Python PyAudio and websockets. Then we’ll cover how to use PyAudio, websockets, and asyncio to transcribe our mic’s audio in real time.

Prerequisites:

  • Install PyAudio
  • Install Websockets

Install PyAudio

We can install PyAudio with 

pip install pyaudio

How to install PyAudio on Mac

You may get the error message about not being able to find portaudio. You can install portaudio with

brew install portaudio


After installing portaudio, you should be able to install PyAudio as shown above. 

How to install PyAudio on Windows

You may get an error message about Microsoft C++, in which case you’ll need to find the right wheel online and install that specific wheel and install as above but replace "pyaudio" with the name of your .whl file.

Install Websockets

We can install websockets with

pip install websockets

Get your AssemblyAI API Key

After installing pyaudio and webhooks, we’ll need to get our API key from Assembly. You will find your key on your console where I’ve marked below.


If you want to follow this tutorial and you’re on a free account, you can upgrade by going to Billing and clicking upgrade as outlined below.


Steps:

  1. Open up a stream with Python PyAudio
  2. Make an async function that will send and receive data
  3. Use async to run in a loop (automatically time out after 1 minute)
  4. Run it in the terminal and talk

Open up a stream with Python PyAudio

The first thing we’ll do is open up a stream using Python’s PyAudio library that we just installed above. To do this, we’ll need to specify the number of frames per buffer, the format, the number of channels, and the sample rate. Our custom stream will use 3200 frames per buffer, PyAudio’s 16bit format, 1 channel, and a sample rate of 16000 Hz.

import pyaudio
 
FRAMES_PER_BUFFER = 3200
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
p = pyaudio.PyAudio()
 
# starts recording
stream = p.open(
   format=FORMAT,
   channels=CHANNELS,
   rate=RATE,
   input=True,
   frames_per_buffer=FRAMES_PER_BUFFER
)

Make an async function that will send and receive data

Now that we’ve opened up our Python PyAudio stream, we can stream audio from the mic to our program, we need to connect it to the AssemblyAI real time websocket to transcribe it. We’ll start by making an asynchronous function that will schedule sending and receiving data concurrently. To do this we’ll need to open up a websocket connection inside of the function, and create two more asynchronous functions that use the open websocket. We’ll use Python asyncio’s gather function to concurrently wait for both the send and receive functions to execute.

Our asynchronous send function will continuously try to stream the data at the rate of the number of frames per buffer we opened up the Python PyAudio stream at, convert that data to base 64, decode it to string format, and send the formatted data in a request to the AssemblyAI websocket endpoint except in the case of a disconnect.

Our asynchronous receive function will continuously try to print out the received data from the endpoint if there is any.

Then we use asyncio.gather to wait for both functions to execute concurrently.


import websockets
import asyncio
import base64
import json
from configure import auth_key
 
# the AssemblyAI endpoint we're going to hit
URL = "wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000"
 
async def send_receive():
   print(f'Connecting websocket to url ${URL}')
   async with websockets.connect(
       URL,
       extra_headers=(("Authorization", auth_key),),
       ping_interval=5,
       ping_timeout=20
   ) as _ws:
       await asyncio.sleep(0.1)
       print("Receiving SessionBegins ...")
       session_begins = await _ws.recv()
       print(session_begins)
       print("Sending messages ...")
       async def send():
           while True:
               try:
                   data = stream.read(FRAMES_PER_BUFFER)
                   data = base64.b64encode(data).decode("utf-8")
                   json_data = json.dumps({"audio_data":str(data)})
                   await _ws.send(json_data)
               except websockets.exceptions.ConnectionClosedError as e:
                   print(e)
                   assert e.code == 4008
                   break
               except Exception as e:
                   assert False, "Not a websocket 4008 error"
               await asyncio.sleep(0.01)
          
           return True
      
       async def receive():
           while True:
               try:
                   result_str = await _ws.recv()
                   print(json.loads(result_str)['text'])
               except websockets.exceptions.ConnectionClosedError as e:
                   print(e)
                   assert e.code == 4008
                   break
               except Exception as e:
                   assert False, "Not a websocket 4008 error"
      
       send_result, receive_result = await asyncio.gather(send(), receive())

Use async to run in a loop (automatically time out after 1 minute)

We’ve finished writing our function that will both send and receive data asynchronously from our PyAudio stream to the AssemblyAI websocket. Now we’ll use asyncio to run that in a loop until we get a timeout or an error using the following line.

asyncio.run(send_receive())


Run it in the terminal and talk

That’s all the coding we need to do. Now, we open up our terminal and run our program. It should look like the video linked at the beginning of this post.

Subscribe to our blog!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

You may also like

Checkout some of our recent research and product updates

How to Convert an MP3 File to Text with an API
Tutorials
How to Convert an MP3 File to Text with an API

how to convert an mp3 file to text with an API

Is Word Error Rate a Good Measure of Speech Recognition Systems?
Deep Learning
Is Word Error Rate a Good Measure of Speech Recognition Systems?

What is Word Error Rate? Word Error Rate is a measure of how accurate an Automatic Speech Recognition (ASR) system performs. Quite literally, it calculates how many “errors” are in the transcription text produced by an ASR system, compared to a human transcription.

The State of Python Speech Recognition in 2021
Tutorials
The State of Python Speech Recognition in 2021

The State of Python Speech Recognition in 2021

build with assemblyai

Accurately convert your audio and video files to text with AssemblyAI's Speech-to-Text API