Tutorials

DeepSpeech for Dummies - A Tutorial and Overview

What is DeepSpeech and how does it work? This post shows basic examples of how to use DeepSpeech for asynchronous and real time transcription.

DeepSpeech for Dummies - A Tutorial and Overview

What is DeepSpeech? DeepSpeech is a neural network architecture first published by a research team at Baidu. In 2017, Mozilla created an open source implementation of this paper - dubbed “Mozilla DeepSpeech”.

The original DeepSpeech paper from Baidu popularized the concept of “end-to-end” speech recognition models. “End-to-end” means that the model takes in audio, and directly outputs characters or words. This is compared to traditional speech recognition models, like those built with popular open source libraries such as Kaldi or CMU Sphinx, that predict phonemes, and then convert those phonemes to words in a later, downstream process.

The goal of “end-to-end” models, like DeepSpeech, was to simplify the speech recognition pipeline into a single model. In addition, the theory introduced by the Baidu research paper was that training large deep learning models, on large amounts of data, would yield better performance than classical speech recognition models.

Today, the Mozilla DeepSpeech library offers pre-trained speech recognition models that you can build with, as well as tools to train your own DeepSpeech models. Another cool feature is the ability to contribute to DeepSpeech’s public training dataset through the Common Voice project.

In the below tutorial, we’re going to walk you through installing and transcribing audio files with the Mozilla DeepSpeech library (which we’ll just refer to as DeepSpeech going forward).

Basic DeepSpeech Example

DeepSpeech is easy to get started with. As discussed in our overview of Python Speech Recognition in 2021, you can download, and get started with, DeepSpeech using Python’s built-in package installer, pip. If you have cURL installed, you can download DeepSpeech’s pre-trained English model files from the DeepSpeech GitHub repo as well. Notice that the files we’re downloading below are the ‘.scorer’ and ‘.pbmm’ files.

# Install DeepSpeech
pip3 install deepspeech
 
# Download pre-trained English model files
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer

A quick heads up - when using DeepSpeech, it is important to consider that only 16 kilohertz (kHz) .wav files are supported as of late September 2021.

Let’s go through some example code on how to asynchronously transcribe speech with DeepSpeech. If you’re using a Unix distribution, you’ll need to install Sound eXchange (sox). Sox can be installed by using either ‘apt’ for Ubuntu/Debian or ‘dnf’ for Fedora as shown below.

sudo apt install sox

or

sudo dnf install sox

Now let’s also install the Python libraries we’ll need to get this to work. We’re going to need the DeepSpeech library, webrtcvad for voice activity detection, and pyqt5 for accessing multimedia (sound) capabilities on desktop systems. Earlier, we already installed DeepSpeech, we can install the other two libraries with pip like so:

pip install webrtcvad pyqt5

Now that we have all of our dependencies, let’s create a transcriber. When we’re finished, we will be able to transcribe any ‘.wav’ audio file just like the example shown below.

Before we get started on building our transcriber, make sure the model files we downloaded earlier are saved in the ‘./models’ directory of the working directory. The first thing we’re going to do is create a voice activity detection (VAD) function and use that to extract the parts of the audio file that have voice activity.

How can we create a VAD function? We’re going to need a function to read in the ‘.wav’ file, a way to generate frames of audio, and a way to create a buffer to collect the parts of the audio that have voice activity. Frames of audio are objects that we construct that contain the byte data of the audio, the timestamp in the total audio, and the duration of the frame. Let’s start by creating our wav file reader function.

All we need to do is open the file given, assert that the channels, sample width, sample rate are what we need, and finally get the frames and return the data as PCM data along with the sample rate and duration. We’ll use ‘contextlib’ to open, read, and close the wav file.

We’re expecting audio files with 1 channel, a sample width of 2, and a sample rate of either 8000, 16000, or 32000. We calculate duration as the number of frames divided by the sample rate.

import contextlib
 
def read_wave(path):
   """Reads a .wav file.
 
   Takes the path, and returns (PCM audio data, sample rate).
   """
   with contextlib.closing(wave.open(path, 'rb')) as wf:
       num_channels = wf.getnchannels()
       assert num_channels == 1
       sample_width = wf.getsampwidth()
       assert sample_width == 2
       sample_rate = wf.getframerate()
       assert sample_rate in (8000, 16000, 32000)
       frames = wf.getnframes()
       pcm_data = wf.readframes(frames)
       duration = frames / sample_rate
       return pcm_data, sample_rate, duration

Now that we have a way to read in the wav file, let’s create a frame generator to generate individual frames containing the size, timestamp, and duration of a frame. We’re going to generate frames in order to ensure that our audio is processed in reasonably sized clips and to separate out segments with and without speech.

The below generator function takes the frame duration in milliseconds, the PCM audio data, and the sample rate as inputs. It uses that data to create an offset starting at 0, a frame size, and a duration. While we have not yet produced enough frames to cover the entire audio file, the function will continue to yield frames and add to our timestamp and offset.

class Frame(object):
   """Represents a "frame" of audio data."""
   def __init__(self, bytes, timestamp, duration):
       self.bytes = bytes
       self.timestamp = timestamp
       self.duration = duration
 
def frame_generator(frame_duration_ms, audio, sample_rate):
   """Generates audio frames from PCM audio data.
 
   Takes the desired frame duration in milliseconds, the PCM data, and
   the sample rate.
 
   Yields Frames of the requested duration.
   """
   n = int(sample_rate * (frame_duration_ms / 1000.0) * 2)
   offset = 0
   timestamp = 0.0
   duration = (float(n) / sample_rate) / 2.0
   while offset + n < len(audio):
       yield Frame(audio[offset:offset + n], timestamp, duration)
       timestamp += duration
       offset += n

After being able to generate frames of audio, we’ll create a function called vad_collector to separate out the parts of audio with and without speech. This function requires an input of the sample rate, the frame duration in milliseconds, the padding duration in milliseconds, a webrtcvad.Vad object, and a collection of audio frames. This function, although not explicitly called as such, is also a generator function that generates a series of PCM audio data.

The first thing we’re going to do in this function is get the number of padding frames and create a ring buffer with a dequeue. Ring buffers are most commonly used for buffering data streams.

We’ll have two states, triggered and not triggered, to indicate whether or not the VAD collector function should be adding frames to the list of voiced frames or yielding that list in bytes.

Starting with an empty list of voiced frames and a not triggered state, we loop through each frame. If we are not in a triggered state, and the frame is decided to be speech, then we add it to the buffer. If after this addition of the new frame to the buffer more than 90% of the buffer is decided to be speech, we enter the triggered state, appending the buffered frames to voiced frames and clearing the buffer.

If the function is already in a triggered state when we process a frame, then we append that frame to the voiced frames list regardless of whether it is speech or not. We then append it, and the truth value for whether it is speech or not, to the buffer. After appending to the buffer, if the buffer is more than 90% non-speech, then we change our state to not-triggered, yield voiced frames as bytes, and clear both the voiced frames list and the ring buffer. If, by the end of the frames, there are still frames in voiced frames, yield them as bytes.

def vad_collector(sample_rate, frame_duration_ms,
                 padding_duration_ms, vad, frames):
   """Filters out non-voiced audio frames.
 
   Given a webrtcvad.Vad and a source of audio frames, yields only
   the voiced audio.
 
   Uses a padded, sliding window algorithm over the audio frames.
   When more than 90% of the frames in the window are voiced (as
   reported by the VAD), the collector triggers and begins yielding
   audio frames. Then the collector waits until 90% of the frames in
   the window are unvoiced to detrigger.
 
   The window is padded at the front and back to provide a small
   amount of silence or the beginnings/endings of speech around the
   voiced frames.
 
   Arguments:
 
   sample_rate - The audio sample rate, in Hz.
   frame_duration_ms - The frame duration in milliseconds.
   padding_duration_ms - The amount to pad the window, in milliseconds.
   vad - An instance of webrtcvad.Vad.
   frames - a source of audio frames (sequence or generator).
 
   Returns: A generator that yields PCM audio data.
   """
   num_padding_frames = int(padding_duration_ms / frame_duration_ms)
   # We use a deque for our sliding window/ring buffer.
   ring_buffer = collections.deque(maxlen=num_padding_frames)
   # We have two states: TRIGGERED and NOTTRIGGERED. We start in the
   # NOTTRIGGERED state.
   triggered = False
 
   voiced_frames = []
   for frame in frames:
       is_speech = vad.is_speech(frame.bytes, sample_rate)
 
       if not triggered:
           ring_buffer.append((frame, is_speech))
           num_voiced = len([f for f, speech in ring_buffer if speech])
           # If we're NOTTRIGGERED and more than 90% of the frames in
           # the ring buffer are voiced frames, then enter the
           # TRIGGERED state.
           if num_voiced > 0.9 * ring_buffer.maxlen:
               triggered = True
               # We want to yield all the audio we see from now until
               # we are NOTTRIGGERED, but we have to start with the
               # audio that's already in the ring buffer.
               for f, s in ring_buffer:
                   voiced_frames.append(f)
               ring_buffer.clear()
       else:
           # We're in the TRIGGERED state, so collect the audio data
           # and add it to the ring buffer.
           voiced_frames.append(frame)
           ring_buffer.append((frame, is_speech))
           num_unvoiced = len([f for f, speech in ring_buffer if not speech])
           # If more than 90% of the frames in the ring buffer are
           # unvoiced, then enter NOTTRIGGERED and yield whatever
           # audio we've collected.
           if num_unvoiced > 0.9 * ring_buffer.maxlen:
               triggered = False
               yield b''.join([f.bytes for f in voiced_frames])
               ring_buffer.clear()
               voiced_frames = []
   # if triggered:
   #     pass
   # If we have any leftover voiced audio when we run out of input,
   # yield it.
   if voiced_frames:
       yield b''.join([f.bytes for f in voiced_frames])

That’s all we need to do to make sure that we can read in our wav file and use it to generate clips of PCM audio with voice activity detection. Now let’s create a segment generator that will return more than just the segment of byte data for the audio, but also the metadata needed to transcribe it. This function requires only one parameter, the ‘.wav’ file. It is meant to filter out all the audio frames that it does not detect voice on, and return the parts of the audio file with voice. The function returns a tuple of the segments, the sample rate of the audio file, and the length of the audio file.

'''
Generate VAD segments. Filters out non-voiced audio frames.
@param waveFile: Input wav file to run VAD on.0
 
@Retval:
Returns tuple of
   segments: a bytearray of multiple smaller audio frames
             (The longer audio split into multiple smaller ones)
   sample_rate: Sample rate of the input audio file
   audio_length: Duration of the input audio file
 
'''
def vad_segment_generator(wavFile, aggressiveness):
   print("Caught the wav file @: %s" % (wavFile))
   audio, sample_rate, audio_length = read_wave(wavFile)
   assert sample_rate == 16000, "Only 16000Hz input WAV files are supported for now!"
   vad = webrtcvad.Vad(int(aggressiveness))
   frames = frame_generator(30, audio, sample_rate)
   frames = list(frames)
   segments = vad_collector(sample_rate, 30, 300, vad, frames)
 
   return segments, sample_rate, audio_length

Now that we’ve handled the wav file and have created all the functions necessary to turn a wav file into segments of voiced PCM audio data that DeepSpeech can process, let’s create a way to load and resolve our models.

We’ll create two functions called load_model and resolve_models. Intuitively, the load_model function loads a model, returning the DeepSpeech object, the model load time, and the scorer load time. This function requires a model and a scorer. This function calculates the time it takes to load the model and scorer via the timer() module from Python. It also creates a DeepSpeech ‘Model’ object from the ‘model’ parameter passed in.

The resolve models function takes a directory name indicating which directory the models are in. Then it grabs the first file ending in ‘.pbmm’ and the first file ending in ‘.scorer’ and loads them as the models.

'''
Load the pre-trained model into the memory
@param models: Output Graph Protocol Buffer file
@param scorer: Scorer file
 
@Retval
Returns a list [DeepSpeech Object, Model Load Time, Scorer Load Time]
'''
def load_model(models, scorer):
   model_load_start = timer()
   ds = Model(models)
   model_load_end = timer() - model_load_start
   print("Loaded model in %0.3fs." % (model_load_end))
 
   scorer_load_start = timer()
   ds.enableExternalScorer(scorer)
   scorer_load_end = timer() - scorer_load_start
   print('Loaded external scorer in %0.3fs.' % (scorer_load_end))
 
   return [ds, model_load_end, scorer_load_end]
 
'''
Resolve directory path for the models and fetch each of them.
@param dirName: Path to the directory containing pre-trained models
 
@Retval:
Retunns a tuple containing each of the model files (pb, scorer)
'''
def resolve_models(dirName):
   pb = glob.glob(dirName + "/*.pbmm")[0]
   print("Found Model: %s" % pb)
 
   scorer = glob.glob(dirName + "/*.scorer")[0]
   print("Found scorer: %s" % scorer)
 
   return pb, scorer

Being able to segment out the speech from our wav file, and load up our models, is all the preprocessing we need leading up to doing the actual Speech-to-Text conversion.

Let’s now create a function that will allow us to transcribe our speech segments . This function will have three parameters: the DeepSpeech object (returned from load_models), the audio file, and fs the sampling rate of the audio file. All it does, other than keep track of processing time, is call the DeepSpeech object’s stt function on the audio.

'''
Run Inference on input audio file
@param ds: Deepspeech object
@param audio: Input audio for running inference on
@param fs: Sample rate of the input audio file
 
@Retval:
Returns a list [Inference, Inference Time, Audio Length]
 
'''
def stt(ds, audio, fs):
   inference_time = 0.0
   audio_length = len(audio) * (1 / fs)
 
   # Run Deepspeech
   print('Running inference...')
   inference_start = timer()
   output = ds.stt(audio)
   inference_end = timer() - inference_start
   inference_time += inference_end
   print('Inference took %0.3fs for %0.3fs audio file.' % (inference_end, audio_length))
 
   return [output, inference_time]

Alright, all our support functions are ready to go, let’s do the actual Speech-to-Text conversion.

In our “main” function below we’ll go ahead and directly provide a path to the models we downloaded and moved to the ‘./models’ directory of our working directory at the beginning of this tutorial.

We can ask the user for the level of aggressiveness for filtering out non-voice, or just automatically set it to 1 (from a scale of 0-3). We’ll also need to know where the audio file is located.

After that, all we have to do is use the functions we made earlier to load and resolve our models, load up the audio file, and run the Speech-to-Text inference on each segment of audio. The rest of the code below is just for debugging purposes to show you the filename, the duration of the file, how long it took to run inference on a segment, and the load times for the model and the scorer.

The function will save your transcript  to a ‘.txt’ file, as well as output the transcription in the terminal.

def main():
   # need audio, aggressive, and model
   # Point to a path containing the pre-trained models & resolve ~ if used
   model = './models/v0.9.3'
   dirName = os.path.expanduser(model)
 
   audio = input("Where is your audio file located?")
   aggressive = 1 #input("What level of non-voice filtering would you like? (0-3)")
 
   # Resolve all the paths of model files
   output_graph, scorer = resolve_models(dirName)
 
   # Load output_graph, alphabet and scorer
   model_retval = load_model(output_graph, scorer)
 
   title_names = ['Filename', 'Duration(s)', 'Inference Time(s)', 'Model Load Time(s)', 'Scorer Load Time(s)']
   print("\n%-30s %-20s %-20s %-20s %s" % (title_names[0], title_names[1], title_names[2], title_names[3], title_names[4]))
 
   inference_time = 0.0
 
   waveFile = audio
   segments, sample_rate, audio_length = vad_segment_generator(waveFile, aggressive)
   f = open(waveFile.rstrip(".wav") + ".txt", 'w')
   print("Saving Transcript @: %s" % waveFile.rstrip(".wav") + ".txt")
   for i, segment in enumerate(segments):
       # Run deepspeech on the chunk that just completed VAD
       print("Processing chunk %002d" % (i,))
       audio = np.frombuffer(segment, dtype=np.int16)
       output = stt(model_retval[0], audio, sample_rate)
       inference_time += output[1]
       print("Transcript: %s" % output[0])
 
       f.write(output[0] + " ")
 
   # Summary of the files processed
   f.close()
 
   # Extract filename from the full file path
   filename, ext = os.path.split(os.path.basename(waveFile))
   print("************************************************************************************************************")
   print("%-30s %-20s %-20s %-20s %s" % (title_names[0], title_names[1], title_names[2], title_names[3], title_names[4]))
   print("%-30s %-20.3f %-20.3f %-20.3f %-0.3f" % (filename + ext, audio_length, inference_time, model_retval[1], model_retval[2]))
   print("************************************************************************************************************")
   print("%-30s %-20.3f %-20.3f %-20.3f %-0.3f" % (filename + ext, audio_length, inference_time, model_retval[1], model_retval[2]))
 
if __name__ == '__main__':
   main()

That’s it! That’s all we have to do to use DeepSpeech to do Speech Recognition on an audio file. That’s a surprisingly large amount of code. A while ago, I also wrote an article on how to do this in much less code with the AssemblyAI Speech-to-Text API. You can read about how to do Speech Recognition in Python in under 25 lines of code if you don’t want to go through all of this code to use DeepSpeech.

Basic DeepSpeech Real-Time Speech Recognition Example

Now that we’ve seen how we can do asynchronous Speech Recognition with DeepSpeech, let’s also build a real time Speech Recognition example. Just like before, we’ll start with installing the right requirements. Similar to the asynchronous example above, we’ll need webrtcvad, but we’ll also need pyaudio, halo, numpy, and scipy.

Halo is for an indicator that the program is streaming, numpy and scipy are used for resampling our audio to the right sampling rate.

pip install deepspeech webrtcvad pyaudio halo numpy scipy

How will we build a real time Speech Recognition program with DeepSpeech? Just as we did in the example above, we’ll need to separate out voice activity detected segments of audio from segments with no voice activity. If the audio frame has voice activity, then we’ll feed it into the DeepSpeech model to be transcribed.

Let’s make an object for our voice activity detected audio frames, we’ll call it VADAudio (voice activity detection audio). To start, we’ll define the format, the rate, the number of channels, and the number of frames per second for our class.

class VADAudio(object):
   """Filter & segment audio with voice activity detection."""
 
   FORMAT = pyaudio.paInt16
   # Network/VAD rate-space
   RATE_PROCESS = 16000
   CHANNELS = 1
   BLOCKS_PER_SECOND = 50

Every class needs an __init__ function. The __init__ function for our VADAudio class, defined below, will take in four parameters: a callback, a device, an input rate, and a file. Everything but the input_rate will default to None if they are not passed at creation.

The input sampling rate will be the rate sampling process we defined in our class above. When we initialize our class, we will also create an instance method called proxy_callback which returns a tuple of None and the pyAudio signal to continue, but before it returns it calls the callback function, hence the name proxy_callback.

Upon initialization, the first thing we do is set ‘callback’ to a function that puts the data into the buffer queue belonging to the object instance. We initialize an empty queue for the instance’s buffer queue. We set the device and input rate to the values passed in, and the sample rate to the Class’ sample rate. Then, we derive our block size and block size input as quotients of the Class’ sample rate and the input rate divided by the number of blocks per second respectively. Blocks are the discrete segments of audio data that we will work with.

Next, we create a PyAudio object and declare a set of keyword arguments. The keyword arguments are format, set to the VADAudio Class’ format value we declared earlier, channels, set to the Class’s channel value, rate, set to the input rate, input, set to true, frames_per_buffer set to the block size input calculated earlier, and stream_callback, set to the proxy_callback instance function we created earlier. We’ll also set our aggressiveness of filtering background noise here to the aggressiveness passed in, set to a default of 3, the highest filter.
We set the chunk size to None for now. If there is a device passed into the initialization of the object, we set a new keyword argument, input_device_index to the device. The device is the input device used, but what we actually pass through will be the index of the device as defined by pyAudio, this is only necessary if you want to use an input device that is not the default input device of your computer. If there was not a device passed in and we passed in a file object, we change the chunk size to 320 and open up the file to read in as bytes. Finally, we open and start a PyAudio stream with the keyword arguments dictionary we made.

def __init__(self, callback=None, device=None, input_rate=RATE_PROCESS, file=None, aggressiveness=3):
       def proxy_callback(in_data, frame_count, time_info, status):
           if self.chunk is not None:
               in_data = self.wf.readframes(self.chunk)
           callback(in_data)
           return (None, pyaudio.paContinue)
       if callback is None: callback = lambda in_data: self.buffer_queue.put(in_data)
       self.buffer_queue = queue.Queue()
       self.device = device
       self.input_rate = input_rate
       self.sample_rate = self.RATE_PROCESS
       self.block_size = int(self.RATE_PROCESS / float(self.BLOCKS_PER_SECOND))
       self.block_size_input = int(self.input_rate / float(self.BLOCKS_PER_SECOND))
       self.pa = pyaudio.PyAudio()
       self.vad = webrtcvad.Vad(aggressiveness)
 
       kwargs = {
           'format': self.FORMAT,
           'channels': self.CHANNELS,
           'rate': self.input_rate,
           'input': True,
           'frames_per_buffer': self.block_size_input,
           'stream_callback': proxy_callback,
       }
 
       self.chunk = None
       # if not default device
       if self.device:
           kwargs['input_device_index'] = self.device
       elif file is not None:
           self.chunk = 320
           self.wf = wave.open(file, 'rb')
 
       self.stream = self.pa.open(**kwargs)
       self.stream.start_stream()

Our VADAudio Class will have 6 functions: resample, read_resampled, read, write_wav, a frame generator, and a voice activity detected segment collector. Let’s start by making the resample function. Due to limitations in technology, not all microphones support DeepSpeech’s native processing sampling rate. This function takes in audio data and an input sample rate, and returns a string of the data resampled into 16 kHz.

def resample(self, data, input_rate):
       """
       Microphone may not support our native processing sampling rate, so
       resample from input_rate to RATE_PROCESS here for webrtcvad and
       deepspeech
 
       Args:
           data (binary): Input audio stream
           input_rate (int): Input audio rate to resample from
       """
       data16 = np.fromstring(string=data, dtype=np.int16)
       resample_size = int(len(data16) / self.input_rate * self.RATE_PROCESS)
       resample = signal.resample(data16, resample_size)
       resample16 = np.array(resample, dtype=np.int16)
       return resample16.tostring()

Next, we’ll make the read and read_resampled functions together because they do basically the same thing. The read function “reads” the audio data, and the read_resampled function will read the resampled audio data. The read_resampled function will be used to read audio that wasn’t sampled at the right sampling rate initially.

def read_resampled(self):
       """Return a block of audio data resampled to 16000hz, blocking if necessary."""
       return self.resample(data=self.buffer_queue.get(),
                            input_rate=self.input_rate)
 
   def read(self):
       """Return a block of audio data, blocking if necessary."""
       return self.buffer_queue.get()

The write_wav function takes a filename and data. It opens a file with the filename and allows writing of bytes with a sample width of 2 and a frame rate equal to the instance’s sample rate, and writes the data as the frames before closing the wave file.

def write_wav(self, filename, data):
       logging.info("write wav %s", filename)
       wf = wave.open(filename, 'wb')
       wf.setnchannels(self.CHANNELS)
       # wf.setsampwidth(self.pa.get_sample_size(FORMAT))
       assert self.FORMAT == pyaudio.paInt16
       wf.setsampwidth(2)
       wf.setframerate(self.sample_rate)
       wf.writeframes(data)
       wf.close()

Before we create our frame generator, we’ll set a property for the frame duration in milliseconds using the block size and sample rate of the instance.

frame_duration_ms = property(lambda self: 1000 * self.block_size // self.sample_rate)

Now, let’s create our frame generator. The frame generator will either yield the raw data from the microphone/file, or the resampled data using the read and read_resampled functions from the Audio class. If the input rate is equal to the default rate, then it will simply read in the raw data, else it will return the resampled data.

def frame_generator(self):
       """Generator that yields all audio frames from microphone."""
       if self.input_rate == self.RATE_PROCESS:
           while True:
               yield self.read()
       else:
           while True:
               yield self.read_resampled()

The final function we’ll need in our VADAudio is a way to collect our audio frames. This function takes a padding in milliseconds, a ratio that controls when the function “triggers” similar to the one in the basic async example above, and a set of frames that defaults to None.

The default value for padding_ms is 300, and the default for the ratio is 0.75. The padding is for padding the audio segments, and a ratio of 0.75 here means that if 75% of the audio in the buffer is speech, we will enter the triggered state. If there are no frames passed in, we’ll call the frame generator function we created earlier. We’ll define the number of padding frames as the padding in milliseconds divided by the frame duration in milliseconds that we derived earlier.

The ring buffer for this example will use a dequeue with a max length of the number of padding frames. We will start in a not triggered state. We will loop through each of the frames, returning if we hit a frame with a length of under 640. As long as the length of the frame is over 640, we check to see if the audio contains speech.

Now, we execute the same algorithm we did above for the basic example in order to collect audio frames that contain speech. While not triggered, we append speech frames to the ring buffer, triggering the state if the amount of speech frames to the total frames is above the threshold or ratio we passed in earlier.

Once triggered, we yield each frame in the buffer and clear the buffer. In a triggered state, we immediately yield the frame, and then append the frame to the ring buffer. We then check the ring buffer for the ratio of non-speech frames to speech frames and if that is over our predefined ratio, we untrigger, yield a None frame, and then clear the buffer.

def vad_collector(self, padding_ms=300, ratio=0.75, frames=None):
       """Generator that yields series of consecutive audio frames comprising each utterence, separated by yielding a single None.
           Determines voice activity by ratio of frames in padding_ms. Uses a buffer to include padding_ms prior to being triggered.
           Example: (frame, ..., frame, None, frame, ..., frame, None, ...)
                     |---utterence---|        |---utterence---|
       """
       if frames is None: frames = self.frame_generator()
       num_padding_frames = padding_ms // self.frame_duration_ms
       ring_buffer = collections.deque(maxlen=num_padding_frames)
       triggered = False
 
       for frame in frames:
           if len(frame) < 640:
               return
 
           is_speech = self.vad.is_speech(frame, self.sample_rate)
 
           if not triggered:
               ring_buffer.append((frame, is_speech))
               num_voiced = len([f for f, speech in ring_buffer if speech])
               if num_voiced > ratio * ring_buffer.maxlen:
                   triggered = True
                   for f, s in ring_buffer:
                       yield f
                   ring_buffer.clear()
 
           else:
               yield frame
               ring_buffer.append((frame, is_speech))
               num_unvoiced = len([f for f, speech in ring_buffer if not speech])
               if num_unvoiced > ratio * ring_buffer.maxlen:
                   triggered = False
                   yield None
                   ring_buffer.clear()

Alright - we’ve finished creating all the functions for the audio class we’ll use to stream to our DeepSpeech model and get real time Speech-to-Text transcription. Now it’s time to create a main function that we’ll run to actually do our streaming transcription.

First we’ll give our main function the location of our model and scorer. Then we’ll create a VADAudio object with aggressiveness, device, rate, and file passed in.

Using the vad_collector function we created earlier, we get the frames and set up our spinner/indicator. We use the DeepSpeech model we created from the model passed through the argument to create a stream. After initializing an empty byte array called wav_data, we go through each frame.

For each frame, if the frame is not None, we show a spinner spinning and then feed the audio content into our stream. If we’ve sent in the argument to save as a .wav file, then that file is also extended. If the frame is a None object, then we end the “utterance” and save the .wav file created, if we created one at all, and clear the byte array. Then we close the stream and open a new one.

def main():
   # Load DeepSpeech model
   model = 'models/v0.9.3/deepspeech-0.9.3-models.pbmm'
   scorer = 'models/v0.9.3/deepspeech-0.9.3-models.scorer'
 
   print('Initializing model...')
   print("model: %s", model)
   model = deepspeech.Model(model)
   if scorer:
       print("scorer: %s", scorer)
       model.enableExternalScorer(scorer)
 
   # Start audio with VAD
   vad_audio = VADAudio(aggressiveness=3,
                        device=None,
                        input_rate=DEFAULT_SAMPLE_RATE,
                        file=None)
   print("Listening (ctrl-C to exit)...")
   frames = vad_audio.vad_collector()
 
   # Stream from microphone to DeepSpeech using VAD
   spinner = Halo(spinner='line')
   stream_context = model.createStream()
   wav_data = bytearray()
   for frame in frames:
       if frame is not None:
           if spinner: spinner.start()
           stream_context.feedAudioContent(np.frombuffer(frame, np.int16))
       else:
           if spinner: spinner.stop()
           print("end utterence")
           text = stream_context.finishStream()
           print("Recognized: %s" % text)
           stream_context = model.createStream()
 
if __name__ == '__main__':
   DEFAULT_SAMPLE_RATE = 16000
   main()

Just like with the asynchronous Speech-to-Text transcription, the real-time transcription is an awful lot of code to do real time Speech Recognition. If you don’t want to manage all this code, you can check out our guide on how to do real time Speech Recognition in Python in much less code using the AssemblyAI Speech-to-Text API.

Conclusion

This ends part one of our DeepSpeech overview and tutorial. In this tutorial, we went over how to do basic Speech Recognition on a .wav file, and how to do Speech Recognition in real time, with DeepSpeech. Part two will be about training your own models with DeepSpeech, and how accurately it performs. It will be coming soon - so be on the lookout for that!

For more information, follow us @assemblyai and @yujian_tang on Twitter.- and subscribe to our newsletter.