Tutorials

Getting Started with ESPnet

ESPnet is the premier end-to-end, open-source speech processing toolkit. This easy-to-follow guide will help you get started using ESPnet for Speech Recognition.

Getting Started with ESPnet

Table of contents

Perhaps more than any other subfield of Deep Learning, speech processing has historically required a lot of specialized knowledge. Converting acoustic speech waves into textual human language, for example, is a difficult problem which has previously required several moving parts like acoustic models, pronunciation models, language models, etc.

As a result of the Deep Learning boom of the 2010s, end-to-end neural speech recognition systems have become feasible to train and consequently risen to popular use. Toolkits like ESPnet provide a foundation for research into such models which can then be trained and put into the hands of laymen, further democratizing the previously specialist-heavy field of speech processing.

In this article, we will provide an introduction to ESPnet and demonstrate how to use pretrained models for Automatic Speech Recognition (ASR), and then compare the results against some other popular ASR frameworks/platforms. Let's dive in!

Introduction

As mentioned above, ESPnet is an end-to-end speech processing toolkit that covers a wide range of speech processing applications, including:

  1. Automatic Speech Recognition (ASR)
  2. Text-to-Speech (TTS)
  3. Speaker Diarization
  4. Speech Translation, and
  5. Speech Enhancement

ESPnet was originally built on Kaldi, another open-source speech processing toolkit. With the release of ESPnet 2, the need for Kaldi is completely omitted, although ESPnet 2 maintains Kaldi-style data preparation for consistency.

The great thing about ESPnet is that it is written in Python, which is the language of choice for many Machine Learning practitioners and enthusiasts, in contrast to Kaldi which is written in C++. Both Kaldi and ESPnet offer several pretrained models, making it easy to incorporate elements of Speech Processing into applications by a more general audience. Let's take a look at how to utilize such a model in Python now.

How to Use Pretrained ESPnet Models for Speech Recognition

In this tutorial, we'll be transcribing an audio file of the first line of the Gettysburg Address. The audio clip is attached below:

audio-thumbnail
Gettysburg Address (First Line)
0:00
/0:10

Unfortunately, ESPnet can be tricky to get working, so for this tutorial we will be using Ubuntu 18.04.06 LTS, an ISO of which can be found here. You can spin up a virtual machine in VMware using this ISO and follow along with this tutorial exactly.

First, we need to install some necessary packages. Open a terminal and execute the following commands:

sudo apt update
yes | sudo apt upgrade
yes | sudo apt install ffmpeg sox cmake git virtualenv libfreetype6-dev gcc
yes | sudo apt-get install python3-dev libxml2-dev libxmlsec1-dev

Now, clone the tutorial repo and navigate into it:

git clone https://github.com/AssemblyAI-Examples/intro-to-espnet.git
cd intro-to-espnet

Next, create a virtualenv and install all necessary packages:

virtualenv venv -p /usr/bin/python3.6
source venv/bin/activate
pip install --no-cache-dir -r requirements.txt

Now that we are done with setup, we can transcribe our audio file in one line by simply executing speech2text.py:

python3 -m speech2text.py

The ground truth transcription and generated transcription will both be printed to the console. We'll examine how speech2text.py works below, but first let's take a look at the results and compare them to other ASR options.

Results

The ground truth transcription can be seen below for reference:

Ground Truth

FOUR SCORE AND SEVEN YEARS AGO OUR FATHERS BROUGHT FORTH ON THIS CONTINENT, A NEW NATION, CONCEIVED IN LIBERTY, AND DEDICATED TO THE PROPOSITION THAT ALL MEN ARE CREATED EQUAL

Next up we have the transcript generated by the best ESPnet model (and the default one in speech2text.py) - a Transformer-based model which yielded 0 errors:

Transcription (Model 1) - 0% WER

FOUR SCORE AND SEVEN YEARS AGO OUR FATHERS BROUGHT FORTH ON THIS CONTINENT, A NEW NATION, CONCEIVED IN LIBERTY, AND DEDICATED TO THE PROPOSITION THAT ALL MEN ARE CREATED EQUAL

We also show the transcript generated by another pretrained ESPnet model, in this case a Conformer-based model which yielded 8 errors:

Transcription (Model 2) - 27% WER

FOUR SCORES IN 7 YEARS AGO OUR FATHERS BROTH AND THIS CONTINENT A NEW NATIONS CONSUMED TO LIBERTY ARE DEDICATED TO THE PROPOSITION THAT ALL MEN ARE CREATED EQUAL

ESPnet offers many other pretrained models which can be explored. Some of these models were tried as above but yielded highly inaccurate results which have therefore been omitted.

How Does ESPnet Compare to other ASR Solutions?

To see how this pretrained ESPnet model stacks up against other ASR solutions, we first consider Kaldi, for which we use its pretrained LibriSpeech ASR Model. For a complete guide on using this model and getting started with Kaldi Speech Recognition, see the linked article.

The LibriSpeech pretrained Kaldi model yields 5 errors, which corresponds to a 17% WER. The corresponding transcript can be seen below:

FOUR SCORE AN SEVEN YEARS AGO OUR FATHERS BROUGHT FORTH UND IS CONTINENT A NEW NATION CONCEIVED A LIBERTY A DEDICATED TO THE PROPOSITION THAT ALL MEN ARE CREATED EQUAL

Beyond the open-source options of ESPnet and Kaldi, we also test against several Cloud Speech-to-Text APIs for comparison. The results are summarized in the below table:

ASR Framework / Platform Word Error Rate

ESPnet - Model 1 0%
ESPnet - Model 2 27%
Kaldi - LibriSpeech Model 17%
AssemblyAI 0%
Amazon Transcribe 0%
Google Cloud Speech-to-Text 0%

As we can see, from a WER perspective, some ESPnet models are very strong and competitive with other offerings. For simple transcription, ESPnet is therefore a good open-source choice. For those looking for options beyond just a low error rate, like high readability or audio intelligence insights, other options may be more fruitful.

Code Breakdown

Now that we have seen how to use ESPnet pretrained models by calling a simple Python script, let's explore the script itself now to get an understanding of what's going on under the hood.  N.B - elements of speech2text.py were pulled from official ESPnet Jupyter notebooks.

Imports

First, as usual, we import all of the packages we'll need. The great thing about ESPnet is that, when only performing inference with pretrained models, it can be used very easily through an API with Hugging Face and/or Zenodo. The espnet_model_zoo package provides this functionality - we import ModelDownloader which provides a simple way to fetch models stored in Hugging Face or Zenodo. The espnet2 package provides us with the appropriate binary we'll need to use the fetched models for inference.

import subprocess as s
import os
import string
import soundfile
from espnet_model_zoo.downloader import ModelDownloader
from espnet2.bin.asr_inference import Speech2Text

Downloading the Pretrained Model

After all imports, it is time to download the pretrained model. The tag variable stores the location of the model which we want to download, as listed here. By default, we use the best model, although two others have been included and commented out for those curious to try other models.

After specifying the model to download, we create a Speech2Text inference object, using an instance of the ModelDownloader class to download the model specified by tag. The remaining arguments simply specify parameters of the inference - see here for more details. Note that device can be changed to cuda if using a GPU.

# BEST MODEL:
tag = "Shinji Watanabe/librispeech_asr_train_asr_transformer_e18_raw_bpe_sp_valid.acc.best"
# SECOND BEST MODEL:
#tag = 'Shinji Watanabe/spgispeech_asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_unnorm_bpe5000_valid.acc.ave'
# EXTREMELY POOR MODEL:
#tag = "kamo-naoyuki/wsj"

d = ModelDownloader()
speech2text = Speech2Text(
    **d.download_and_unpack(tag),
    device="cpu", #cuda if gpu
    minlenratio=0.0,
    maxlenratio=0.0,
    ctc_weight=0.3,
    beam_size=10,
    batch_size=0,
    nbest=1
)

Helper Functions

Before using our inference object to generate a transcript, we create two helper functions. First, we create text_normalizer(), which returns input text uppercased and stripped of all punctuation.

def text_normalizer(text):
    text = text.upper()
    return text.translate(str.maketrans('', '', string.punctuation))

Additional Details

The text_normalizer() method makes use of a translation table. When initialized with three arguments, a translation table maps the characters in the first argument sequentially to the characters in the second argument, and maps the characters in the final argument to the null character. Unlisted characters are mapped to themselves. The translate function then applies the translation table to a target string.

An example can be seen below. First, we see the syntax for creating a translation table, with the corresponding object visualized in the box. Then, we see the syntax for applying this translation table to the string abcDEF, and the effects of applying it visualized in the box, ultimately resulting in the string A1cf

Focusing back on the text_normalizer function, we observe that it first converts the text to uppercase with text = text.upper(). Next, it creates a translation table with str.maketrans('', '', string.punctuation) which simply maps all punctuation to the null character. Finally return text.translate(...) then uses this translation table to strip our text of all puntuation and return it.

Second, we create a function to generate and return transcripts given an audio filepath. Create a function to get transcripts. soundfile.read() reads in our audio data, and then speech2text predicts the speech. We isolate the best prediction with nbests[0] and then return the transcript along with the sample rate of the audio file.

def get_transcript(path):
    speech, rate = soundfile.read(path)
    nbests = speech2text(speech)
    text, *_ = nbests[0]
    return text, rate

Transcribing the Audio File

With the helper functions defined, we are ready to transcribe our audio file. For every file in our audio folder, we pass the file through the get_transcript() function, first converting it to a .wav file with ffmpeg if need be given that our models can only process .wav files.

path = os.path.join(os.getcwd(), 'egs')
files = os.listdir(path+'/audio')

for file in files:
    if not file.endswith('.wav'):
        # Convert to .wav and change file extension to .wav
        os.chdir(path+'/audio')
        s.run(f"ffmpeg -i {file} {file.split('.')[0]}.wav", shell=True, check=True, universal_newlines=False)
        os.chdir('../..')
        file = file.split('.')[0]+'.wav'
        
        # Transcribe and delete generated file
        text, est_rate = get_transcript(f'{path}/audio/{file}')
        os.remove(f'{path}/audio/{file}')
    else:
        text, est_rate = get_transcript(f'{path}/audio/{file}')

After transcription, we read in the true corresponding transcription in the text folder of each audio file, and then print out both the ground truth and hypothesis transcripts:

    # Fetch true transcript
    label_file = file.split('.')[0]+'.txt'
    with open(f'{path}/text/{label_file}', 'r') as f:
        true_text = f.readline()
    # Print true transcript and hypothesis
    print(f"\n\nReference text: {true_text}")
    print(f"ASR hypothesis: {text_normalizer(text)}\n\n")

That's all there is to it! Using pretrained models with ESPnet is very straightforward and helps bring speech processing into the hands of a more general audience with just a few lines of code.

Final Words

While we looked only at ASR in this tutorial, ESPnet has pretrained models for a variety of other tasks, including Text-to-Speech, Speaker Diarization, and Noise Reduction. Check out the ESPnet documentation for more information.

For more guides and tutorials on NLP and general Machine Learning, feel free to check out more of our blog, or follow our newsletter.

Follow the AssemblyAI Newsletter