What is Word Error Rate?
Word Error Rate is a measure of how accurate an Automatic Speech Recognition (ASR) system performs. Quite literally, it calculates how many “errors” are in the transcription text produced by an ASR system, compared to a human transcription.
Broadly speaking, it’s important to measure the accuracy of any machine learning system. Whether it’s a self driving car, NLU system like Amazon Alexa, or an Automatic Speech Recognition system like the ones we develop at AssemblyAI, if you don’t know how accurate your machine learning system is, you’re flying blind!
In the field of Automatic Speech Recognition, the Word Error Rate has become the de facto standard for measuring how accurate a speech recognition model is. A common question we get from customers is “What’s your WER?”. In fact, when our company was accepted into Y Combinator back in 2017, one of the first questions the YC partners asked us was “What’s your WER?”
How To Calculate Word Error Rate (WER)
The actual math behind calculating a Word Error Rate is the following:
What we are doing here is combining the number of Substitutions (S), Deletions (D), and Insertions (N), divided by the Number of Words (N).
So for example, let’s say the following sentence is spoken:
If our Automatic Speech Recognition (ASR) is not very good, and predicts the following transcription:
Then our Word Error Rate (WER) would be 50%! That’s because there was 1 Substitution, “there” was substituted with “bear”. Let’s say our ASR system instead only predicted:
And for some reason didn’t even predict a second word. In this case our Word Error Rate (WER) would also be 50%! That’s because there is a single Deletion - only 1 word was predicted by our ASR system when what was actually spoken was 2 words. The lower the Word Error Rate the better. You can think of word accuracy as 1 minus the Word Error Rate. So if your Word Error Rate is 20%, then your word accuracy, ie how accurate your transcription is, is 80%:
Is Word Error Rate a Good Measure of Speech Recognition Systems?
As with everything, it is not black and white. Overall, Word Error Rate can tell you how “different” the automatic transcription was compared to the human transcription, and generally, this is a reliable metric to determine how “good” an automatic transcription is.
For example, take the following:
Spoken text: “Hi my name is Bob and I like cheese. Cheese is very good.” Predicted text by model 1: "Hi my frame is knob and I bike leafs. Cheese is berry wood" WER: 46% Predicted text by model 2: "Hi my name is Bob and I bike cheese. Cheese is good." WER: 15%
As we can see, model 2 has a lower WER of 15%, and is obviously way more accurate to us as humans than the predicted text from model 1. This is why, in general, WER is a good metric for determining the accuracy of an Automatic Speech Recognition system.
However, take the following example:
Spoken: "I like to bike around" Model 1 prediction: "I liked to bike around" WER: 20% Model 2 prediction: "I like to bike pound" WER: 20%
In the above example, both Model 1 text and Model 2 text have a WER of 20%. But Model 1 clearly results in a more understandable transcription compared to Model 2. That’s because even with the error that Model 1 makes, it still results in a more legible and easy to understand transcription. This is compared to Model 2, which makes a mistake in the transcription that results in the transcription text being illegible (ie, “word soup”).
To further illustrate the downfalls of Word Error Rate, take the following example:
Spoken: "My name is Paul and I am an engineer" Model 1 prediction: "My name is ball and I am an engineer" WER: 11.11% Model 2 prediction: "My name is Paul and I'm an engineer" WER: 22.22%
In this example, Model 2 does a much better job producing an understandable transcription, but it has double (!!) the WER compared to Model 1. This difference in WER is especially pronounced in this example because the text contains so few words, but still, this illustrates some “gotchas” to be aware of when reviewing WER.
What these above examples illustrate is that the Word Error Calculation is not “smart”. It is literally just looking at the number of substitutions, deletions, and insertions that appear in the automatic transcription compared to the human transcription. This is why WER in the “real world” can be so problematic.
Take for example the simple mistake of not normalizing the casing in the human transcription and automatic transcription.
Human transcription: "I live in New York" Automatic transcription: "i live in new york" WER: 60%
In this example, we see that the automatic transcription has a WER of 60% (!!) even though it perfectly transcribed what was spoken. Simply because we were comparing the human transcription with New York capitalized, compared to new york lowercase, the WER algorithm sees these as completely different words!
This is a major “gotcha” we see in the wild, and it’s why we internally always normalize our human transcriptions and automatic transcriptions when computing a WER, through things like:
- Lowercasing all text
- Removing all punctuation
- Changing all numbers to their written form ("7" -> "seven")
Alternatives to Word Error Rate
Unfortunately, Word Error Rate is the best metric we have today to determine the accuracy of an Automatic Speech Recognition system. There have been some alternatives proposed, but none have stuck in the research or commercial communities. A common technique used is to weight Substitutions, Insertions, and Deletions differently. So, for example, adding 0.5 for every Deletion versus 1.0.
However, unless the weights are standardized, it’s not really a “fair” way to compute WER. System 1 could be reporting a much lower WER because it used lower “weights” for Substitutions, for example, compared to System 2.
That’s why, for the time being, Word Error Rate is here to stay, but it’s important to keep the pitfalls we demonstrated in mind when calculating WER yourself!