Kaldi Speech Recognition for Beginners

In this tutorial, we’ll use the open-source speech recognition toolkit Kaldi in conjunction with Python to automatically transcribe audio files. By the end of the tutorial, you’ll be able to get transcriptions in minutes with one simple command!

Important Note

For this tutorial, we are using Ubuntu 20.04.03 LTS (x86_64 ISA). If you are on Windows, the recommended procedure is to install a virtual machine and follow this tutorial exactly on a Debian-based distro (preferably the exact one mentioned above - you can find an ISO here)

Before we can get started with Kaldi for Speech Recognition, we'll need to perform some installations.

Installations

Prerequisites

The most notable prerequisite is time and space. The Kaldi installation can take hours, and consumes almost 40 GB of disk space, so prepare accordingly. If you need transcriptions ASAP, check out the Cloud Speech-to-Text APIs section!

Automatic Installation

If you would like to manually install Kaldi and its dependencies, you can move on to the next subsection. If you are comfortable with an automatic installation, you can follow this subsection.

You will need wget and git installed on your machine in order to follow along. wget comes installed natively on most Linux distributions, but you may need to open a terminal and install git with

(base) ryan@ubuntu:~$ sudo apt install git-all

Next, navigate into the directory in which you would like to install Kaldi, and then fetch the installation script with

(base) ryan@ubuntu:~$ wget https://raw.githubusercontent.com/AssemblyAI/kaldi-asr-tutorial/master/setup.sh

This command downloads the setup.sh file, which effectively just automates the manual installation below. Be sure to open this file in a text editor and inspect it to make sure you understand it and are comfortable running it. You can then perform the setup with

(base) ryan@ubuntu:~$ sudo bash setup.sh

Install Note

If you have multiple CPUs, you can perform a parallel build by supplying the number of processors you would like to use. For example, to use 4 CPUs, enter sudo bash setup.sh 4

Running the above command will install all of Kaldi's dependencies, and then Kaldi itself. You will be required to confirm that all dependencies are installed at one point (several minutes into the installation). We suggest checking and confirming, but if you are following along on a fresh Ubuntu 20.04.03 LTS install (perhaps on a virtual machine), then you can skip confirming by instead running

(base) ryan@ubuntu:~$ y | sudo bash setup.sh

In this case, you do not need to interact with the terminal at all during installation. The installation will likely take several hours, so you can leave and come back to it when the installation is complete. Once the installation is complete, enter the project directory with

(base) ryan@ubuntu:~$ cd ./kaldi/egs/kaldi-asr-tutorial/s5

and then move on to transcribing an audio file.

Manual Installation

Before manually installing Kaldi, we’ll need to install some additional packages. First, open a terminal, and run the following commands:

(base) ryan@ubuntu:~$ sudo apt update && sudo apt upgrade

(base) ryan@ubuntu:~$ yes | sudo apt install unzip git-all

(base) ryan@ubuntu:~$ (pkgs="wget
g++
make
automake
autoconf
sox
gfortran
libtool
subversion
python2.7
python3.8
zlib1g-dev")

(base) ryan@ubuntu:~$ yes | sudo apt-get install $pkgs

Additional Information

You can copy these commands and paste them into the terminal by right clicking in terminal and selecting “Paste”.
We’ll also need Intel MKL, which we will install later via Kaldi if you do not have it already.

Installing Kaldi

Now we can get started installing Kaldi for Speech Recognition. First, we need to clone the Kaldi repository. In the terminal, navigate to the directory in which you’d like to clone the repository. In this case, we are cloning to the Home directory.

Run the following command:

(base) ryan@ubuntu:~$ git clone https://github.com/kaldi-asr/kaldi.git kaldi --origin upstream

Installing Tools

To begin our Kaldi installation, we’ll first need to perform the tools installation. Navigate into the tools directory with the following command:

(base) ryan@ubuntu:~$ cd ./kaldi/tools

and then install Intel MKL if you don’t already have it. This will take time - MKL is a large library.

(base) ryan@ubuntu:~/kaldi/tools$ yes | extras/install_mkl.sh

Now we check to ensure all dependencies are installed. Given our preparatory installations, you should get a message telling you that all dependencies are indeed installed.

If you do not have all dependencies installed, you will get an output telling you which dependencies are missing. Install any remaining packages you need, and then rerun the extras/check_dependencies.sh command. New required installations may now appear as a result of the dependencies you just installed. Continue alternating between these two steps (checking missing dependencies and installing them) until you receive a message saying that all dependencies are installed ("all OK.").

(base) ryan@ubuntu:~/kaldi/tools$ extras/check_dependencies.sh

Finally, run make. See the install note below if you have a multi-CPU build.

(base) ryan@ubuntu:~/kaldi/tools$ make CXX=g++

Install Note

If you have multiple CPUs, you can do a parallel build by supplying the "-j" option to make in order to expedite the install. For example, to use 4 CPUs, enter make -j 4

Installing Src

Next, we need to perform src install. First, cd into src

(base) ryan@ubuntu:~/kaldi/tools$ cd ../src

And then run the following commands. See the install note below if you have a multi-CPU build. This build may take several hours for uniprocessor systems.

(base) ryan@ubuntu:~/kaldi/src$ ./configure --shared

(base) ryan@ubuntu:~/kaldi/src$ make depend CXX=g++

(base) ryan@ubuntu:~/kaldi/src$ make CXX=g++

Install Note

Again, you can supply the -j option to both make depend and make if you have multiple CPUs in order to expedite the install. For example, to use 4 CPUs, enter make depend -j 4 and make -j 4

Cloning the Project Repository

Now it’s time to clone the project repository provided by AssemblyAI, which hosts the code required for the remainder of the tutorial. The project repository follows the structure of the other folders in kaldi/egs (the “examples” directory in Kaldi root) and includes additional files to automate the transcription generation for you.

Navigate into egs folder, clone the project repository, and then navigate into the s5 subdirectory

(base) ryan@ubuntu:~/kaldi/src$ cd ../egs

(base) ryan@ubuntu:~/kaldi/egs$ git clone https://github.com/AssemblyAI/kaldi-asr-tutorial.git

(base) ryan@ubuntu:~/kaldi/egs$ cd kaldi-asr-tutorial/s5

Additional Information

At this point you can delete all other folders in the egs directory. They take up about 10 GB of disk space, but consist of other examples that you may want to check out after this tutorial.

Transcribing an Audio File - Quick Usage

Now we’re ready to get started transcribing an audio file! We’ve provided everything you need to automatically transcribe a .wav file in a single line of code.

For a minimal example, all you need to do is run

(base) ryan@ubuntu:~/kaldi/egs/kaldi-asr-tutorial/s5$ python3 main.py

This command will transcribe the provided example audio file gettysburg.wav - a 10 second .wav file containing the first line of the Gettysburg Address. The command will take several minutes to execute, after which you will find the transcription in kaldi-asr-tutorial/s5/out.txt

Important Note

You will need an internet connection the first time you run main.py in order to download the pre-trained models.

If you would like to transcribe your own .wav file, first place it in the s5 subdirectory, and then run:

(base) ryan@ubuntu:~/kaldi/egs/kaldi-asr-tutorial/s5$ python3 main.py gettysburg.wav

Where you replace gettysburg.wav with the name of your file. If the only .wav file in the s5 subdirectory is your target audio file, you can simply run python3 main.py without specifying the filename.

This automated process will work best with a single speaker and a relatively short audio. For more complicated usage, you’ll have to read the next section and modify the code to suit your needs, following along with the Kaldi documentation.

Resetting the Directory

Each time you run main.py it will call reset_directory.py, which removes all files/folders generated by main.py (except the downloaded tarballs of the pre-trained models) in order to start each run with a clean slate. This means that your out.txt transcription will be deleted if you call main.py on another file, so be sure to move out.txt to another directory if you would like to keep it before transcribing another file.

If you interrupt the main.py execution while the pre-trained models are downloading, you will receive errors downstream. In this case, run the following command to completely reset the directory (i.e. remove the pre-trained model tarballs in addition to the files/folder removed by reset_directory.py)

(base) ryan@ubuntu:~/kaldi/egs/kaldi-asr-tutorial/s5$ python3 reset_directory_completely.py

Transcribing an Audio File - Understanding the Code

If you’re interested in understanding how Kaldi's Speech Recognition generated the transcription in the previous section, then read on!

We’re going to dive into main.py in order to understand the entire process of generating a transcription with Kaldi. Keep in mind that our use case is a toy example to showcase how to use pre-trained Kaldi models for ASR. Kaldi is a very powerful toolkit which accommodates much more complicated usage; but it does have a sizable learning curve, so learning how to properly apply it to more complicated tasks will take some time.

Also, we’ll give brief overviews of the theory behind what’s going on in different sections, but ASR is a complicated topic, so by nature our conversation will be surface level!

Let’s get started.

Imports

We kick things off with some imports. First, we call the reset_directory.py file that clears the directory of files/folders generated by the rest of main.py so we can start with a clean slate. Then we import subprocess so we can issue bash commands, as well as some other packages which we’ll use for os navigation and file manipulation.

import reset_directory
import subprocess as s
import os
import sys
import glob

Argument Validation

Next, we perform some argument validation. We ensure that there is a maximum of one additional argument passed in to main.py; and, if there is one, we ensure that is a .wav file. If there is no argument given, then we simply choose the first .wav file found by glob.glob, if such a file exists.

We save the filename (with and without extension) in variables for later use.

if len(sys.argv) == 1:
    try:
        FILE_NAME_WAV = glob.glob("*.wav")[0]
    except:
        raise ValueError("No .wav file in the root directory")
elif len(sys.argv) == 2:
    FILE_NAME_WAV = list(sys.argv)[1]
    if FILE_NAME_WAV[-4:] != ".wav":
        raise ValueError("Provided filename does not end in '.wav'")
else:
    raise ValueError("Too many arguments provided. Aborting")

FILE_NAME = FILE_NAME_WAV[:-4]

Kaldi File Generation

Now it’s time to create some standard files that Kaldi requires to generate transcriptions. We save the s5 directory path into a variable so that we can easily navigate back to it, and then create and navigate into a data/test directory that we will store our data in.

ORIGINAL_DIRECTORY = os.getcwd()

# Make data/test dir
os.makedirs("./data/test")
os.chdir("./data/test")

The first file we’ll generate is called spk2utt, which maps speakers to their utterances. For our purposes, we assume that there is one speaker and one utterance, so the file is easy to generate automatically.

with open("spk2utt", "w") as f:
    f.write("global {0}".format(FILE_NAME))

Next, we create the inverse mapping in the utt2spk file. Note that this file is one-to-one, unlike the one-to-many nature of spk2utt (one speaker may have multiple utterances, but each utterance can have only one speaker). For our purposes it is once again easy to generate this file:

with open("utt2spk", "w") as f:
    f.write("{0} global".format(FILE_NAME))

The last file we create is called wav.scp. It maps audio file identifiers to their system paths. We again generate this file automatically.

wav_path = os.getcwd() + "/" + FILE_NAME_WAV
with open("wav.scp", "w") as f:
    f.write("{0} {1}".format(FILE_NAME, wav_path))

Finally, we return to the root directory

os.chdir(ORIGINAL_DIRECTORY)

Additional Information

Note that these are not the only possible input files that Kaldi can use, just the bare minimum. For more advanced usage, such as gender mapping, check out the Kaldi documentation.

MFCC Configuration File Modification

To perform ASR with Kaldi on our audio file, we must first determine some method of representing this data in a format that a Kaldi model can handle. For this, we use Mel-frequency cepstral coefficients (MFCCs). MFCCs are a set of coefficients that define the mel-frequency cepstrum of the audio, which itself is a cosine transform of the logarithmic power spectrum of a nonlinear mapping (mel-frequency) of the Fourier transform of the signal. If that sounds confusing, don’t worry - it’s not necessary to understand for the purposes of generating transcriptions! The important thing to know is that MFCCs are a low dimensional representation of an audio signal that are inspired by human auditory processing.

There is a configuration file that we use when we are generating MFCCs, located in ./conf/mfcc_hires.conf. The only thing we need to know from a practical standpoint is that we must modify this file to list the proper sample rate for our input .wav file. We do this automatically as follows:

First, we call a subprocess which opens a bash shell and uses sox to get the audio information of the .wav file. Then, we perform string manipulation to isolate the sample rate of the .wav file.

bash_out = s.run("soxi {0}".format(FILE_NAME_WAV), stdout=s.PIPE, text=True, shell=True)
cleaned_list = bash_out.stdout.replace(" ","").split('\n')
sample_rate = [x for x in cleaned_list if x.startswith('SampleRate:')]
sample_rate = sample_rate[0].split(":")[1]

Next, we open and read the MFCC configuration file so that we can modify it

with open("./conf/mfcc_hires.conf", "r") as mfcc:
    lines = mfcc.readlines()

And identify the line that sets the sample frequency and isolate it.

line_idx = [lines.index(l) for l in lines if l.startswith('--sample-frequency=')]
line = lines[line_idx[0]]

Next, we reformat this line to list the sample rate of our .wav file as identified by the soxi command.

line = line.split("=")
line[1] = sample_rate + line[1][line[1].index(" #"):]
line = "=".join(line)

Finally, we replace the relevant line in the lines list, collapse this list back into a string, and then write this string to the MFCC configuration file.

lines[line_idx[0]] = line
final_str = "".join(lines)
with open("./conf/mfcc_hires.conf", "w") as mfcc:
    mfcc.write(final_str)

Feature Extraction

Now we can get started processing our audio file. First, we open a file for logging our bash outputs, which we will use for every bash command going forward. Then, we copy our .wav file into the ./data/test directory, and then copy the whole ./data/test directory into a new directory (./data/test_hires) for processing.

with open("main_log.txt", "w") as f:
    bash_out = s.run("cp {0} data/test/{0}".format(FILE_NAME_WAV), stdout=f, text=True, shell=True)
    
    bash_out = s.run("utils/copy_data_dir.sh data/test data/test_hires", stdout=f, text=True, shell=True)

Next, we generate MFCC features using our data and the configuration file we previously modified.

    bash_out = s.run("steps/make_mfcc.sh --nj 1 --mfcc-config "
                     "conf/mfcc_hires.conf data/test_hires", stdout=f, text=True, shell=True)

Additional Information

More information about the arguments of the bash command can be found here:

steps/make_mfcc.sh: specifies the location of the shell script which generates mfccs
--nj 1: specifies the number of jobs to run with. If you have a multi-core machine, you can increase this number
--mfcc-config conf/mfcc_hires.conf: specifies the location of the configuration file we previously modified
data/test_hires: specifies the data folder containing the relevant data we will operate on

This command generates the conf, data, and log directories as well as the feats.scp, frame_shift, utt2dur, and utt2num_frames files (all within the data/test_hires directory)

After this, we compute the cepstral mean and variance normalization (CMVN) statistics on the data, which minimizes the distortion caused by noise contamination. That is, CMVN helps make our ASR system more robust against noise.

    bash_out = s.run("steps/compute_cmvn_stats.sh data/test_hires", stdout=f, text=True, shell=True)

Finally, we use the fix_data_dir.sh shell script to ensure that the files within the data directory are properly sorted and filtered, and also to create a data backup in data/test_hires/.backup.

    bash_out = s.run("utils/fix_data_dir.sh data/test_hires", stdout=f, text=True, shell=True)

Pre-trained Model Download and Extraction

Now that we have performed MFCC feature extraction and CMVN normalization, we need a model to pass the data through. In this case we will be using the Librispeech ASR Model, found in Kaldi’s pre-trained model library, which was trained on the LibriSpeech dataset. This model is composed of four submodels:

An i-vector extractor
A TDNN-F based chain model
A small trigram language model
An LSTM-based model for rescoring

To download these models, we first check to see if these tarballs are already in our directory. If they are not, we download them using wget

    for component in ["chain", "extractor", "lm"]:
        tarball = "0013_librispeech_v1_{0}.tar.gz".format(component)
        if tarball not in os.listdir():
            bash_out = s.run('wget http://kaldi-asr.org/models/13/{0}'.format(tarball), stdout=f, text=True, shell=True)

and extract them using tar.

    bash_out = s.run('for f in *.tar.gz; do tar -xvzf "$f"; done', stdout=f, text=True, shell=True)

This creates the exp/nnet3_cleaned, exp/chain_cleaned, data/lang_test_tgsmall, and exp/rnnlm_lstm_1a directories.

nnet3_cleaned is the i-vector extractor directory
chain_cleaned is the chain model directory
tgsmall is the small trigram language model directory
and rnnlm is the LSTM-based rescoring model

Warning

If the wget process is interrupted during download, you will run into errors downstream. In this case, run the below in terminal to delete any model tarballs that are there and completely reset the directory. We call reset_directory.py rather than reset_directory_completely.py by default so we don't have to download the models (~430 MB compressed) each time we run main.py.

(base) ryan@ubuntu:~/kaldi/egs/kaldi-asr-tutorial/s5$ python3 reset_directory_completely.py

Decoding Generation

Extracting i-vectors

Next up, we’ll extract i-vectors, which are used to identify different speakers. Even though we have only one speaker in this case, we extract i-vectors anyway for the general use case, and because they are expected downstream.

We create a directory to store the i-vectors and then run a bash command to extract them:

    os.makedirs("./exp/nnet3_cleaned/ivectors_test_hires")
    bash_out = s.run("steps/online/nnet2/extract_ivectors_online.sh --nj 1 "
                     "data/test_hires exp/nnet3_cleaned/extractor exp/nnet3_cleaned/ivectors_test_hires",
                     stdout=f, text=True, shell=True)

Additional Information

More information about the arguments of the bash command can be found here:

steps/online/nnet2/extract_ivectors_online.sh: specifies the location of the shell script which extracts the i-vectors
--nj 1: specifies the number of jobs to run with. If you have a multi-core machine, you can increase this number
data/test_hires: specifies the location of the data directory
exp/nnet3_cleaned/extractor: specifies the location of the extractor directory
exp/nnet3_cleaned/ivectors_test_hires: specifies the location to store the i-vectors

Constructing the Decoding Graph

In order to get our transcription, we need to pass our data through the decoding graph. In our case, we will construct a fully-expanded decoding graph (HCLG) that represents the language model, lexicon (pronunciation dictionary), context-dependency, and HMM structure in the model.

Additional Information

The output of the decoding graph is a Finite State Transducer that has word-ids on the output, and transition-ids on the input (the indices that resolve to pdf-ids)

HCLG stands for a composition of functions, where

H contains HMM definitions, whose inputs are transition-ids and outputs are context-dependent phones
C is the context-dependency, that takes in context-dependent phones and outputs phones
L is the lexicon, which takes in phones and outputs words
and G is an acceptor that encodes the grammar or language model, which both takes in and outputs words

The end result is our decoding, in this case a transcription of our single utterance.

Before we can pass our data through the decoding graph, we need to construct it. We create a directory to store the graph, and then construct it with the following command.

    os.makedirs("./exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall")
    bash_out = s.run("utils/mkgraph.sh --self-loop-scale 1.0 --remove-oov "
                     "data/lang_test_tgsmall exp/chain_cleaned/tdnn_1d_sp exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall",
                     stdout=f, text=True, shell=True)

Additional Information

More information about the arguments of the bash command can be found here:

utils/mkgraph.sh: specifies the location of the shell script which constructs the decoding graph
--self-loop-scale 1.0: Scales self-loops by the specified value relative to the language model¹
--remove-oov: remove out-of-vocabulary (oov) words
data/lang_test_tgsmall : specifies the location of the language directory
exp/chain_cleaned/tdnn_1d_sp: specifies the location of the model directory
exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall: specifies the location to store the constructed graph

Decoding using the Generated Graph

Now that we have constructed our decoding graph, we can finally use it to generate our transcription!

First we create a directory to store the decoding information, and then decode using the following command.

    os.makedirs("./exp/chain_cleaned/tdnn_1d_sp/decode_test_tgsmall")
    bash_out = s.run("steps/nnet3/decode.sh --acwt 1.0 --post-decode-acwt 10.0 --nj 1 "
                     "--online-ivector-dir exp/nnet3_cleaned/ivectors_test_hires "
                     "exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall "
                     "data/test_hires exp/chain_cleaned/tdnn_1d_sp/decode_test_tgsmall",
                     stdout=f, text=True, shell=True)

Additional Information

More information about the arguments of the bash command can be found here:

steps/nnet3/decode.sh: specifies the location of the shell script which runs the decoding
--acwt 1.0: Sets the acoustic scale. The default is 0.1, but this is not suitable for chain models²
--post-decode-acwt 10.0: Scales the acoustics by 10 so that the regular scoring script works (necessary for chain models)
--nj 1: specifies the number of jobs to run with. If you have a multi-core machine, you can increase this number
--online-ivector-dir exp/nnet3_cleaned/ivectors_test_hires: specifies the i-vector directory
exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall: specifies the location of the graph directory
data/test_hires: specifies the location of the data directory
exp/chain_cleaned/tdnn_1d_sp/decode_test_tgsmall: specifies the location to store the decoding information

Transcription Retrieval

It’s time to retrieve our transcription! The transcription lattice is stored as a GNU zip file in the decode_test_tgsmall directory, among other files (including word-error rates if you have input a Kaldi text file).

We store the directory paths of our zip file and graph word.txt file, and then pass these into a command variable which stores our bash command. This command unzips our zip file, and then writes the optimal path through the lattice (the transcription) to a file called out.txt in our s5 directory.

    gz_location = "exp/chain_cleaned/tdnn_1d_sp/decode_test_tgsmall/lat.1.gz"
    words_txt_loc = "{0}/exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall/words.txt".format(ORIGINAL_DIRECTORY)
    command = "../../../src/latbin/lattice-best-path " \
              "ark:'gunzip -c {0} |' " \
              "'ark,t:| utils/int2sym.pl -f 2- " \
              "{1} > out.txt'".format(gz_location, words_txt_loc)
    bash_out = s.run(command, stdout=f, text=True, shell=True)

Additional Information

More information about the arguments of the bash command can be found here:

../../../src/latbin/lattice-best-path: specifies the location of the file which navigates the lattice to generate the decoding
ark:'gunzip -c {0} |': pipes the command to unzip the lattice file to shell via popen()³
'ark,t:| utils/int2sym.pl -f 2- {1} > out.txt': writes the decoding to out.txt⁴

Let’s take a look at how our generated transcription compares to the true transcription!

Real:

FOUR SCORE AND SEVEN YEARS AGO OUR FATHERS BROUGHT FORTH ON THIS CONTINENT A NEW NATION CONCEIVED IN LIBERTY AND DEDICATED TO THE PROPOSITION THAT ALL MEN ARE CREATED EQUAL

Transcription:

FOUR SCORE AN SEVEN YEARS AGO OUR FATHERS BROUGHT FORTH UND IS CONTINENT A NEW NATION CONCEIVED A LIBERTY A DEDICATED TO THE PROPOSITION THAT ALL MEN ARE CREATED EQUAL

Out of 30 words we had 5 errors, yielding a word error rate of about 17%.

Rescoring with LSTM-based Model

We can rescore with the LSTM-based model using the below command:

    command = "../../../scripts/rnnlm/lmrescore_pruned.sh --weight 0.45 --max-ngram-order 4 " \
              "data/lang_test_tgsmall exp/rnnlm_lstm_1a data/test_hires " \
              "exp/chain_cleaned/tdnn_1d_sp/decode_test_tgsmall exp/chain_cleaned/tdnn_1d_sp/decode_test_rescore"
    bash_out = s.run(command, stdout=f, text=True, shell=True)

Additional Information

More information about the arguments of the bash command can be found here:

../../../scripts/rnnlm/lmrescore_pruned.sh: specifies the location of the shell script which runs the rescoring⁵
--weight 0.45: specifies the interpolation weight for the RNNLM
--max-ngram-order 4: approximates the lattice-rescoring by merging histories in the lattice if they share the same ngram history which prevents the lattice from exploding exponentially
data/lang_test_tgsmall : specifies the old language model directory
exp/rnnlm_lstm_1a: specifies the RNN language model directory
data/test_hires: specifies the data directory
exp/chain_cleaned/tdnn_1d_sp/decode_test_tgsmall: specifies the input decoding directory
exp/chain_cleaned/tdnn_1d_sp/decode_test_rescore: specifies the output decoding directory

We again output the transcription to a .txt file, in this case called out_rescore.txt:

    command = "../../../src/latbin/lattice-best-path " \
              "ark:'gunzip -c exp/chain_cleaned/tdnn_1d_sp/decode_test_rescore/lat.1.gz |' " \
              "'ark,t:| utils/int2sym.pl -f 2- " \
              "{0}/exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall/words.txt > out_rescore.txt'".format(ORIGINAL_DIRECTORY)
    bash_out = s.run(command, stdout=f, text=True, shell=True)

In our case, rescoring did not change our generated transcription, but it may improve yours!

Advanced Kaldi Speech Recognition

Hopefully this tutorial gave you an understanding of the Kaldi basics and a jumping off point for more complicated NLP tasks! We just used a single utterance and a single .wav file, but we might also consider cases where we want to do speaker identification, audio alignment, or more.

You can also go beyond using pre-trained models with Kaldi. For example, if you have data to train your own model, you could make your own end-to-end system, or integrate a custom acoustic model into a system that uses a pre-trained language model. Whatever your goals, you can use the building blocks identified in this article to help you get started!

There are a ton of different ways to process audio to extract useful information, and each way offers its own subfield rich with task-specific knowledge and a history of creative approaches. If you want to dive deeper into Kaldi to build your own complicated NLP systems, you can check out the Kaldi documentation here.

Cloud Speech-to-Text APIs

Kaldi is a very powerful and well-maintained framework for NLP applications, but it’s not designed for the casual user. It can take a long time to understand how Kaldi operates under the hood, an understanding that is necessary to put it to proper use.

In this vein, Kaldi is consequently not designed for plug-and-play speech processing applications. This can pose difficulties for those who don’t have the time or know-how to customize and train NLP models, but who want to implement speech recognition in larger applications.

If you want to get high quality transcripts in just a few lines of code, AssemblyAI offers a fast, accurate, and easy-to-use Speech-to-Text API. You can sign up for a free API token here and gain access to state-of-the-art models that provide:

Core Transcription
- Asynchronous Speech-to-Text
- Real-Time Speech-to-Text
Audio Intelligence
- Summarization
- Emotion Detection
- Sentiment Analysis
- Topic Detection
- Content Moderation
- Entity Detection
- PII Redaction
- And much more!

Grab a token and check out the AssemblyAI docs to get started.

Footnotes

1) Link to "Scaling of transition of acoustic probabilities" in the Kaldi documentation

2) Link to "Decoding with 'chain' models" in the Kaldi documentation

3) Link to "Extended filenames: rxfilenames and wxfilenames" in the Kaldi documentation

4) Link to "Table I/O" in the Kaldi documentation

5) Link to the lmrescore_pruned.sh script in the Kaldi ASR GitHub repo

6) For other beginner resources on getting started with Kaldi, check out this, this, or this resource. Elements from these sources have been adapted for use within this article.

Kaldi Speech Recognition for Beginners - A Simple Tutorial

Installations

Prerequisites

Automatic Installation

Manual Installation

Installing Kaldi

Installing Tools

Installing Src

Cloning the Project Repository

Transcribing an Audio File - Quick Usage

Resetting the Directory

Transcribing an Audio File - Understanding the Code

Imports

Argument Validation

Kaldi File Generation

MFCC Configuration File Modification

Feature Extraction

Pre-trained Model Download and Extraction

Decoding Generation

Extracting i-vectors

Constructing the Decoding Graph

Decoding using the Generated Graph

Transcription Retrieval

Rescoring with LSTM-based Model

Advanced Kaldi Speech Recognition

Cloud Speech-to-Text APIs

Footnotes

Popular posts

AI trends in 2024: Graph Neural Networks

AI for Universal Audio Understanding: Qwen-Audio Explained

Combining Speech Recognition and Diarization in one model

How DALL-E 2 Actually Works