Deep Learning

How to evaluate Speech Recognition models

Speech Recognition models are key in extracting useful information from audio data. Learn how to properly evaluate speech recognition models in this easy-to-follow guide.

How to evaluate Speech Recognition models

Speech Recognition models produce a transcript of speech in audio data. This transcript can be used in its own right, or further processed for other purposes, like using Large Language Models to analyze the contents of the speech. For example, below we ask LeMUR to summarize a 73 minute State of the Union address in a bulleted list with two bullet levels:

Transcripts can be used in conjunction with Large Language Models to flexibly extract or generate information from spoken language

Summary

  • The economy and job growth
    • 12 million new jobs created in the last two years
    • Infrastructure law funding 20,000 projects to rebuild infrastructure
    • Chips and Science Act creating manufacturing jobs
  • Bipartisan legislation
    • Over 300 bipartisan bills signed
    • Infrastructure law and bills helping veterans
  • Lowering costs for Americans
    • Capping insulin costs at $35/month for seniors
    • Limiting out-of-pocket drug costs to $2,000 for seniors
    • Cutting shipping costs by 90%
    • Banning junk fees like resort fees
  • Tackling climate change
    • Investing in clean energy and protecting against natural disasters
  • Making the wealthy and corporations pay their fair share
    • No billionaire should pay lower taxes than teachers/firefighters

Regardless of the intended use case, the performance of the Speech Recognition model that is used to produce these transcripts is key. In the case that we want transcripts as the end product, a better model will produce higher-quality transcripts. In the case that we want to go further and analyze this transcript with e.g. an LLM, then an accurate transcript is necessary to avoid analyzing incorrect or incomplete information.

Properly evaluating Speech Recognition models therefore becomes paramount for pipelines that process audio data. Below, we’ll explore how to properly evaluate Speech Recognition models, which we’ll see can prove to be a tricky task.

Ultimately, proper evaluation comes down to adopting a scientific approach in order to ensure that the results are accurate, repeatable, and reliable. Let’s explore how to design such an evaluation procedure now.

Evaluation metrics for Speech Recognition models

First and foremost, when we develop any AI model, we need a metric by which to evaluate it. For example, accuracy is a common metric for classifiers which measures the fraction of classifications that the model gets correct on a testing dataset.

This classifier sorts integers into even and odd numbers. In this case, we evaluate performance with the accuracy metric.

While it would be nice to have one metric that encompasses all aspects of the model, no such metric exists. Even a metric like accuracy, which intuitively sounds like an all-encompassing metric, has issues that preclude it from being a holistic measure of performance.

A problem with accuracy

One problem with accuracy is that it does not align with our human preferences when working with imbalanced datasets. For example, let’s consider a population of people, some of whom have a rare disease (red) and most of whom do not (green):

If we work in the healthcare sector, we may be interested in creating a classifier to identify people with the disease so that we may treat it. We want this classifier to sort people into those who do not have the disease, and those who do have the disease:

If someone develops a classifier that has 96% accuracy, we may at first think it is a near-perfect model. But, we can actually make such a classifier now without much thought. In fact, we don’t even need to use any Machine Learning methods - we design our classifier to simply always predict “no disease”:

While this model has a high accuracy, it does not do what we want it to do. In particular, we want to have a very high probability of classifying people who do have the disease as such. We are much more willing to forgive false positives than false negatives because these two errors have very different costs. A false positive (i.e. classifying someone who doesn’t have the disease as someone who does) has the cost of additional time and resources to further investigate the issue; however, a false negative (i.e. classifying someone who does have the disease as someone who doesn’t) has the cost of a medical issue going unaddressed and, in the worst case, a life. We therefore would prefer a classifier like the one below:

We prefer such a classifier, despite the fact that its accuracy is not as high as the previous one (only 84% accurate in this case). By evaluating these classifiers according to recall, rather than accuracy, we find that the model our human preferences say is better is actually measured to be better according to the metric, in contrast to accuracy.

Ultimately, which metrics are used to evaluate a model is a matter of choice, and this choice implicitly reflects what we value in terms of the model’s performance. For example, a metric like accuracy implicitly assumes that we value all errors equally. Let’s explore which evaluation metrics allow us to measure the performance of Automatic Speech Recognition (ASR ) models in a meaningful way now, starting with the ubiquitous word error rate.

Word error rate

The most common evaluation metric for ASR models is word error rate (WER). WER measures the fraction (or percentage) of errors a Speech Recognition model makes at the word level relative to a ground-truth transcript created by a human transcriber. A model with a lower WER is therefore preferred over a model with a higher WER, all things equal.

In particular, WER counts insertions, in which the model does transcribe a word that is not present, deletions, in which the model does not transcribe a word that is present, and substitutions, in which the model incorrectly transcribes a word.

A ground truth transcript along with one example of each of the 3 types of error that WER counts

The sum of these counts is then divided by the total number of words in the ground-truth transcript to get the word error rate. When represented as a percentage, WER corresponds to the number of errors an ASR model can be expected to make per 100 words spoken.

Calculation of WER for an example model output given the ground truth transcription

Beyond WER

It may seem like there’s not much more to say - we have a metric which counts errors, so we can evaluate our models, choose the best one, and be off to the races. Unfortunately, the evaluation process is not this simple, as our intuition may tell us after examining the problem with accuracy above.

While WER is a good metric to get a general sense of the performance of a Speech Recognition system, it has problems when used in isolation as the sole evaluation metric. Given two models, it is not always the case that the one with the lower WER produces transcripts that a human would consider preferable. This is the case for a variety of reasons.

For instance, disfluencies, or filler words, are the words spoken by people when pausing to think during speech; words such as “um” or “uh” are disfluencies. The sentence “I want to go to the … uh … store today” contains the disfluency “uh”, which signals to a listener that the speaker is pausing to think.

To some people in some contexts, this signal may be valuable information and would therefore be important to communicate via text in the transcript. On the other hand, other people in other contexts may consider this information irrelevant or distracting, and would prefer a transcript without disfluencies. A problem arises when there is disagreement between the model and the ground truth transcriber about whether this information is valuable. If a Speech Recognition model transcribes disfluencies but the human who generated the “ground-truth” transcript did not (or vice versa), then the WER of the ASR model will be artificially inflated:

The presence of disfluencies can artificially inflate WER

In contrast to the transcript in the figure above, the transcript “I have to grow” would have a lower WER; but it does not capture the meaning of the sentence as well as the previous transcript with a higher WER.

This issue, among others (one of which we will discuss next), precludes WER from constituting a holistic measure of a model’s performance. While commonly used as the primary evaluation metric in academic settings out of convenience, Speech Recognition models designed for real-world use require greater care in evaluation for several reasons. Let’s look at a different metric which measures an aspect of model performance that is particularly important for industry use-cases now.

Proper noun evaluation

At the beginning of the last section, we noted that WER may at first appear to be an all-encompassing metric because it counts errors. What we did not consider was the relative magnitudes of these errors. In addition to potential problems with disfluencies, another aspect in which WER is deficient is that it counts all errors as equivalent, which is often not the case. Consider the below transcripts:

The two transcripts have the same WER, but transcript B is certainly preferable to transcript A. Proper nouns like names carry greater information than common words like articles (“the”, “a”, “an”). Therefore, two models with similar WERs may be equivalent in terms of the number of errors they make, but not in terms of the magnitude of these errors.

Therefore, to supplement simple WER analyses, we need to include another metric that captures performance with respect to these important proper nouns. We arrive at another choice - we’ve decided that we want to pay attention to proper noun performance, but we have not yet decided how to pay attention to it. That is, we must choose another metric by which to measure proper noun performance specifically. What metric should we choose?

At first, we may consider looking at WER just for proper nouns, where we measure WER considering only the proper nouns in the ground-truth transcript, rather than all words. Unfortunately, this metric does not fully align with our expectations.

Consider two models, Model A and Model B. Let’s say that the models have the same overall WER, but Model B has a lower proper noun WER than Model A. From these numbers, we may be tempted to conclude that Model B is better, but consider these example transcripts:

Clearly, Model A’s transcript is preferred (especially to Jon, who will likely forgive a misspelling more than being mistaken for another person). Our human preferences do not align with the metric that we have chosen to evaluate proper noun performance. What’s the cause of this misalignment?

The reason is that WER is a coarse metric that affords no notion of similarity. That is, the inherent right/wrong discontinuity of WER does not recognize that “John” is much closer to “Jon” than "Andrew" is, and therefore "John" is significantly preferable to transcribe. Instead, WER penalizes these two misspellings as “equivalently incorrect”.

Therefore, to capture proper noun performance we need a better metric - one that provides a notion of similarity and operates at the character (or sub-word) level. The Jaro-Winkler distance satisfies these requirements nicely - its notion of closeness (operating at the character level) will allow our model to get “partial credit” for transcribing "Jon" as "John", unlike WER.

When transcribing the word “Jon”, both WER and JW will yield zero cost if the word is properly transcribed. However, WER treats all errors equally, while JW yields a lower cost for less significant errors.

Finally, we summarize, through an interesting note, our approach in arriving at the Jaro-Winkler (JW) metric for proper nouns. Initially, we recognized a failure of WER in that all errors are equally costly despite all words not being equally important. In particular, proper nouns have a disproportionate impact on the meaning of language. After relegating our purview to proper nouns, we recognized another failure of WER - that all types of errors are equally costly. That is, all misspellings are equivalently costly despite some obfuscating the meaning of the sentence significantly more than others.

The first concern is which words are incorrectly transcribed (operating at the word level), and the second concern is how these words are incorrectly transcribed (operating at the character level). The former is not giving greater weight to more important words, and the latter is not giving greater weight to more egregious misspellings.

Proper Averaging

Now we move beyond the choice of metrics themselves to discuss a detail of how we actually calculate these metrics. Consider evaluating a model on two audio files, one which cites part of the Declaration of Independence, and one which cites the beginning of the Gettysburg Address. To find the overall average WER for this model on these files, we may be tempted to simply average the per-file WERs:

Visually, we may be able to tell that this result does not quite line up. In particular, looking at all of the text together, it does not appear that there is a mistake roughly one in every ten words (as there would be with a 9.1% WER). What's the issue?

During this inspection we were notably looking at all of the text together. The “file breaks'' are somewhat artificial and should be meaningless to the final metric result. Let’s imagine merging the audio file into one, passing that through the Speech Recognition system, getting the same results, and then calculating WER for this single audio file.

In this case that we get a different WER - what happened? Recall that WER is a rate, so we cannot simply ignore that percentage sign when we take averages. The true WER is the ratio of total errors across the dataset to total words in the dataset, so we need to convert from rates to amounts. Therefore, in the average calculation we must weight each file WER by the number of words in that file, and then divide by total number of words to get the overall WER. As we can see below, we get the proper WER when we take a weighted average in this way.

Proof

In the proof below, mistakes are denoted by m, number of words are denoted by w, and WERs are denoted by r. A subscript of i denotes the i-th file in a dataset, and a subcript of total denotes a value across the entire dataset.

By taking a standard average, we were implicitly assuming that all files are the same length such that w_i = w_total / n, where we arrive at a standard average:

So, when calculating your own metrics make sure you are actually evaluating what you think you are evaluating, for example by taking weighted averages.

With our discussion of metrics thoroughly exhausted, we can move on to another critical aspect of Speech Recognition model evaluation - datasets.

Evaluation datasets

As we saw above, the metrics we choose to evaluate an ASR model are critically important. Also of critical importance is with reference to what we actually calculate these metrics, i.e. the datasets we choose to use for evaluation. Let’s explore this topic now.

Relevance

First, evaluation datasets must be relevant to the intended use case of the Speech Recognition model. If a model is evaluated on a dataset of podcast data, then the results will not be completely reflective of expected performance if the same model is applied to, say, telephone audio.

Consistency

Up to this point, we have discussed several considerations when evaluating Speech Recognition models. All of this information is relevant to evaluate one model in isolation; however, there are other considerations if we are comparing models.

In particular, a proper scientific evaluation procedure requires consistency. To test the differences in performance between models, we must ensure that all other variables (i.e. those beyond model identity) remain constant. One of these variables is the dataset on which models are evaluated. For example, the below data makes it seem like Model B is superior to Model A (considering only WER):

However, what hasn’t been reported is that Model A was evaluated on Dataset X, and Model B was evaluated on Dataset Y. Therefore, if Dataset X is simply a more challenging dataset (e.g. is noisier), then the apparent discrepancy in performance between the models could simply be a result of this fact. We have not isolated the change in model, and so performance discrepancies cannot be attributed to this change only.

By not directly comparing against the same dataset, apparent model performance can be misleading

Noisy public datasets

So far we’ve seen that we desire relevant, consistent datasets in the evaluation procedure. Two models that we wish to compare may report benchmarks on private datasets; so, if we are to directly compare two, we must run our own evaluations using public datasets. The problem is that these public datasets are often not relevant to real-world use cases. Many “academic” datasets contain audio whose data is more “sterile” than what is seen in real-world applications, making the results of the evaluation not completely reflective of expected real-world performance.

We therefore reach an apparent impasse - to ensure a consistent testing procedure, we must use public datasets; but these public datasets are too clean for accurate evaluation. How do we resolve this conundrum?

The solution is to simulate real-world datasets using public datasets. We do this by adding noise to the audio files. The amount of noise that we add is a choice, and we can monitor performance as a function of noise to see how different models fare.

Below is an example plot of the average WER of two models reported as a function of the Signal-to-Noise Ratio (SNR). As we can see, in the high SNR range (i.e. low noise added) where models are commonly evaluated, the differences in performance are less apparent. However, in the low SNR range (more noise added) where models are commonly applied, the differences in performance are more apparent. In this way, we can use public datasets to probe which model is more robust to the noise seen in real-world applications.

Averaging across datasets

Above, we saw the importance of weighting our average WER calculation by file length to get the true dataset WER. A similar procedure must occur when we are averaging across multiple datasets. Below we see an example chart of a model’s per-dataset WER along with its overall average performance across datasets.

The average value may at first seem unreasonable given the height of the bar for Dataset Y; however, this unreasonableness stems from yet another implicit assumption, this time that the datasets are of the same size. If Dataset Y is much smaller than Dataset X, then the model’s performance on it does not carry as much weight in terms of evaluating the model’s overall performance. The average value is guaranteed to be higher than the top of the shortest bar and beneath the top of the tallest, but where it falls in that range is completely a function of the relative sizes of the datasets. Therefore, we must similarly weight by dataset size when computing averages across datasets.

Speech Recognition normalizers

Last but not least, there is another crucial step to ASR evaluation to discuss - the process of normalization. Recall that evaluation happens by comparing an ASR model’s transcript to a human-generated transcript. If the model transcribes “they are” but the human transcribes “they’re”, then we will incur a WER cost despite the fact that, for many purposes, humans would not consider this an error. We incorporate such preferences through our choice of a normalizer.

A normalizer is a tool that takes in a model’s output and “normalizes” it to a representation that allows for a fair comparison. It can contract or expand contractions, remove disfluencies, standardize spellings, and more. In essence, the normalizer is what “ignores” discrepancies that we don’t care about (e.g. “they are” vs “they’re”) to bring the model transcript and human transcript to a level playing field so that we can properly evaluate the model based on what we do care about.

Our normalizer implicitly encodes information about which types of errors we seek to evaluate

For proper scientific comparison of Speech Recognition models, we must ensure that all aspects of the evaluation process are consistent between different models. Therefore we should also ensure that we are using a consistent normalizer when evaluating different models, just like how we must use a consistent dataset as explored above.

Previously we saw that metrics reported on unavailable, private datasets are not scientific given that the testing procedure is not repeatable. Therefore, for proper scientific evaluation in this case, we must run our own evaluations using public datasets to compare models. Similarly, if models report metrics using a private normalizer, we must run our own evaluations using a public normalizer to compare models. An open-source normalizer such as the Whisper normalizer will suffice for such purposes. It is of critical importance to use a consistent normalizer when comparing Speech Recognition models to ensure that the results are accurate and scientific.

Summary

We’ve explored the scientific approach to properly evaluating Speech Recognition models. In particular, we saw the importance of ensuring that all aspects of the evaluation pipeline are consistent between evaluations, where the only variable that is changing is the model being used. To this end, we examined the importance of using a consistent dataset and a consistent normalizer for fair evaluations.

Additionally, we observed the importance of evaluating using a dataset that reflects the type of audio that the model is expected to see in its real-world application. Given the lack of publicly-available datasets for many such applications, we saw how the addition of noise to public datasets can simulate this behavior to better probe real-world performance when comparing models.

Finally, we saw how the choice of metric is of utmost importance when evaluating ASR models. The choice of metric(s) implicitly encodes what we value as a “good” transcription. In particular, we saw that WER, while a good measure of overall performance, fails to capture the magnitude of errors and instead counts the number of errors. This is especially important in real-world applications, where the accurate transcription of proper nouns is highly important. To address this, a dedicated metric just for proper nouns was found to be a necessary evaluation tool. We found that proper noun WER is too coarse of a metric for this purpose. Instead, Jaro-Winkler is a suitable alternative which importantly affords a fine-grained notion of similarity, resulting in a metric that better aligns with our preferences as humans.

To summarize, the below figure depicts a poor evaluation and comparison procedure for two ASR models:

An improper ASR model evaluation pipeline

While this next figure depict a proper scientific evaluation and comparison procedure for two ASR models:

A proper ASR model evaluation pipeline

If you enjoyed this article, feel free to check out some of our other articles to learn about the Emergent Abilities of Large Language Models or How ChatGPT actually works. Alternatively, feel free to subscribe to our newsletter to stay in the loop when we release new content like this.