A Survey on End-To-End Speech Recognition Architectures in 2021

Deep Learning
A Comparison of End-to-End Speech Recognition Architectures in 2021


Accuracy is the most important characteristic of a speech recognition system. While AssemblyAI’s production end-to-end approach for speech recognition is able to provide better accuracy than other commercial grade speech recognition systems, improvements could still be made to achieve human performance. As part of our core research and development efforts to continue pushing the state of the art of speech recognition accuracy, in this post, we explore speech recognition architectures that are gaining new popularity in both research and industry settings.

1. Introduction

As we continue our core research and development efforts to push the boundaries of state of the art accuracy, we begin exploring architectures for end-to-end speech recognition that are gaining relative new popularity, not only in the research domain but also in production settings in industry ([1], [2]). We are specially interested in architectures that have shown production level accuracy matching or surpassing that of conventional hybrid DNN-HMM ([3], [4]) speech recognition systems. With that in mind, we survey the following two architectures: Listen Attend and Spell (LAS, [5]), and Recurrent Neural Network Transducers (RNNT, [6]).

We survey these new architectures from a production perspective. In section 2, we report on accuracy comparisons from some of the latest research done on both LAS and RNNT ([7], [8], [2]). We also do a comparison of these techniques against Connectionist Temporal Classification (CTC, [9]) ([7], [1]), which is a more widely known end-to-end approach, and is what early versions of AssemblyAI’s speech recognition system was based on. We also a look at recent research that combines RNNT and LAS into a single system ([8], [2]).

In section 3, keeping our focus on production grade speech recognition, we compare LAS and RNNT models from the perspective of feature parity. We focus on contextualization ([10], [1]), inverse text normalization ([11], [1], [8]), word timestamps, and the possibility of doing real time speech recognition ([1], [2]).

And we finish this post in section 4, with high level conclusions about our survey as well as next steps for future blog posts.

2. Accuracy

While human parity speech recognition accuracy has been reached for some research data sets ([12]), word error rates (WER) for industrial production grade applications in challenging acoustic environments is far from human level. One example of such domain is spontaneous telephone conversations, where, even in research datasets, human level parity has not yet been achieved ([13]).

We set our focus in this section on LAS and RNNT systems tested on production data, and not on research datasets. We avoid looking into research datasets as they are easily overfit by models that do not generalize well to real live audio signals. We want to keep our focus on datasets that contain challenging characteristics such as channel noise, compression, ambient noise, crosstalk, accents and speaker diversity.

2.1 LAS vs RNNT

When comparing accuracy of LAS and RNNT based architectures, the consensus seems to be that LAS has better accuracy than RNNT. [7] compares both LAS and RNNT models using 12500 hours of production voice-search traffic, artificially distorted by adding noise and by simulating room acoustics. WER is reported on both dictation as well as voice search utterances. Both LAS and RNNT use the same encoder neural network architecture and size. Their parameters however, are different and were initialized to those of a converged CTC trained network with the same encoder architecture. The decoders also have the same architecture, same number of layers, size and output the set of 26 lower case letters plus numerals, space and punctuation symbols. As a result, the only difference in terms of numbers of parameters lies in the RNNT joint network and the LAS attention mechanism. Neither the LAS nor the RNNT model use any sort of external language model. Table 1 summarizes their results showing LAS performing better in general.

Table 1: Accuracy of RNNT vs LAS

[8] also shows LAS performing better in general. They emphasize RNNT models lag in quality while LAS shows competitive performance when compared to hybrid DNN-HMM models ([11]). [8] also compares the WERs of LAS with RNNT. Similar to [7], the encoder and decoder networks are the same for both RNNT and LAS, differing only on the joint and attention networks. Both predict outputs from a set of 4096 word pieces. The training data used is the same for both models and consists of 30,000 hours of voice search training data, corrupted with added noise and room acoustics. The models are tested on two test sets: one made of utterances shorter than 5.5 seconds (Short Utts.) and another one made of utterances longer than 5.5 seconds (Long Utts.). Results showing LAS performing better can be seen in table 1. LAS shows degrading accuracy on long utterances, which is attributed to the attention mechanism as explained in [14].

While [2] does not explicitly compare the accuracy rates of LAS and RNNT models, they do mention, similar to [8], that under low latency constraints the accuracy of LAS outperforms conventional DNN-HMM models, while RNNT models do not.

2.2 LAS and RNNT vs CTC

Comparisons with CTC based models are not as simple as comparisons just between LAS and RNNT. The main reason being that a CTC model is highly dependent on the use of an external language model to have acceptable accuracy. LAS and RNNT models do not need an external language model due to the existence of a decoder component in the model.

[7] compares LAS, RNNT and CTC models without any external LM, where the encoder architecture and size is the same across all three models. CTC’s accuracy without any external LM, which is shown in table 2, is significantly lower than RNNT and LAS. The rest of the experimental details are the same as those described section 2.1.

[1] compares RNNT and CTC, but with significant differences in each architecture. The CTC model consists of 6 LSTM layers with each layer having 1200 cells and a 400 dimensional projection layer. The model outputs 42 phoneme targets through a softmax layer. Decoding is preformed with a 5gram first pass language model and a second pass LSTM LM rescoring model. The RNNT model’s encoder consists of 8 LSTM layers with 2048 cells each and a 640 dimensional projection layer. A time reduction layer of factor 2 is inserted after the second layer of the encoder. The prediction network has 2 LSTM layers each with 2048 cels and a 640 dimensional projection layer. The joint network has 640 hidden units. The output layer models 4096 word pieces. The size of the RNNT model is 120MB, while the size of the CTC model is 130MB.

[1] performs training with 27500 hours of voice search and dictation data, artificially corrupted to simulate room acoustics and noise. Table 2 shows the comparison between their CTC and RNNT models on both a voice search and a dictation test sets, showing the RNNT model being significantly better.

Table 2: Accuracy of RNNT and LAS vs CTC

2.3 Combining LAS and RNNT

While LAS is perceived as having better accuracy, RNNT models are perceived to have production quality features, such as streaming capabilities, that make them more desirable.

With the objective of bridging the accuracy gap between both architectures, [8] and [2] develop a combination of RNNT and LAS models. RNNT is used during first pass decoding, and LAS is used as a second pass rescoring model.

The experimental details of [8] with respect to separate RNNT and LAS models are described in section 2.1. The two pass experiments use the same architecture for both and LAS and RNNT. The number of parameters are the same as well but this time the encoder parameters are shared between them. Training is done in three steps. First an RNNT model is converged. Then the RNNT encoder is frozen and a LAS decoder is trained with it. Finally a combined loss is used to retrain both the RNNT and LAS models (with a shared encoder) together. Table 3 shows the rescoring approach significantly improves the accuracy of the RNNT model. It also improves the LAS weakness with respect to longer utterances.

[2] also implements LAS rescoring on a second pass, where the first pass is an RNNT model. Besides using search data during training they also use data from multiple other domains (e.g.: far field, phone) and accented speech from countries other than the U.S. The number of hours used for training is not specified, but the model architecture details related to LAS rescoring are similar: A shared encoder is used between the RNNT model and the LAS model, and the LAS model is used to rescore hypotheses coming from the RNNT model. They add an additional LAS encoder layer in between the shared encoder and the LAS decoder. Table 3 shows that LAS rescoring significantly improves RNNT accuracy.

More importantly, table 3 shows the combination of RNNT and LAS beating a conventional HMM-RNN hybrid model. The acoustic model of the conventional model outputs context dependent phones, and uses a phonetic dictionary with close to 800,000 words. A 5-gram language model is used during first pass decoding and a MaxEnt language model is used for second pass rescoring. The total size of the conventional model is around 87 GBs. The RNNT+LASS model’s size is 0.18 GBs. The RNNT model has 120 million parameters and the LAS model (both the additional encoder and the decoder) have 57 million additional parameters. Parameters are quantized to be 8bit fixed point.

Table 3: Accuracy of RNNT and LAS rescoring

3. Feature Parity

While accuracy is the most important characteristic of a speech recognition system, there are many other features that contribute to usability and cost. In this section we explore LAS and RNNT from the perspective of features such as contextualization, inverse text normalization, timestamps and real time speech recognition.

3.1 Contextualization

The words produced by a speaker during a dialog depend on the context the speaker is in. E.g., if the speaker wants to call a friend, he is very likely to say the friend’s name. The name may be made of uncommon or foreign words. These words may have had very few or no samples in the training data of the ASR system, and they may not be recognized correctly.

Contextualization in ASR is about biasing the models towards the words and phrases that belong to the context without hurting the ASR performance of non contextual sentences.

[10] implements this in a CTC system by incorporating in the language model a dynamic class that represents the contextual information. During decoding, the dynamic class is replaced with a finite state transducer (FST) containing the contextual phrases and vocabulary (e.g. contact names). On-the-fly rescoring is then applied with a language model that contains the ngrams corresponding to those contextual phrases and vocabulary. The accuracy effect of using this type of contextual mechanism is shown in table 4. Although no results are shown on generic test sets, the accuracy on contextual test sets is improved significantly.

[1] implements contextualization on RNNT models. This is done through an FST that represents the entire contextual ngram model (instead of just the vocabulary and phrases), and this model is interpolated with the RNNT model through shallow fusion during beam search decoding. The accuracy improvements, shown in table 4, are also significant. Results on generic test sets are not shown, but the improvements on contextual test sets suggests that contextualization can be successfully implemented on an RNNT framework as well.

[8] also implements contextualization through FSTs and shallow fusion in a two pass RNNT + LAS rescoring system. Since shallow fusion is equivalent to an interpolation of scores, just like rescoring , it’s fair to assume that only one single bias FST during first pass RNNT decoding is needed. Their contextual test set results are used to show the performance impact of other tuning approaches rather than contextualization itself, therefore we don’t summarize those results here.

With respect to LAS, [15] implements shallow fusion selectively on top of a LAS system. [16] goes further and implements contextualization inside the LAS framework in an all neural way, calling it CLAS. The accuracy improvements of using contextualization are significant but are measured on artificial test sets that are either generated using TTS of where the context is generated from truth transcriptions, therefore we don’t summarize those results here.

3.2 Inverse Text Normalization

Inverse text normalization converts a transcription in the spoken domain (e.g.: "one twenty three first street") to the written domain (e.g.: "123 1st st"). Speech recognition systems with conventional DNN-HMM models usually do this with separate models. The output of the ASR system, which is in the spoken domain, is passed through a separate model that does the conversion to the written domain.

One could do the same with an end-to-end ASR system. But the following work suggests that a single end-to-end system could learn all acoustic, phonetic, language and normalization models. The normalization can be learned by including numerals, space and punctuation symbols in the output layer of their models. Training data would have to be in the written domain.

Contextualization Accuracy

[11] implements a LAS system able to match state of the art accuracy from hybrid DNN-HMM systems. This network incorporates acoustic, pronunciation and language models into a single network which outputs written form text. It explicitly mentions that a text normalization component is not needed to normalize the output of the recognizer.

[1] implements their RNNT models in a way that they output text in the written domain as well. They are able to improve their numeric output performance by including numeric utterances synthetically produced with text-to-speech (TTS) in their training data.

[8] does an error analysis comparison between a two pass RNNT/LAS model and a conventional model, arguing that one reason the two pass model is able to have good accuracy is due to having learned normalization.

3.3 Word Timestamps

Depending on the application, another feature that is very useful is providing time alignments in the speech recognition result. These are usually provided as word timestamps, meaning the beginning and end time of a recognized word in the audio stream. This feature is necessary, for example, in captioning applications for podcasts or videos.

A LAS system, in its original form, lacks the ability to produce timestamps. As described in [5], the alignment between text and audio are provided by the attention mechanism. However, the attention coefficients produced span through the entire audio stream. While there has been research regarding monotonic attention mechanisms ([17], [18], [19]), which could be more promising in providing word timestamps, this research seems to be in early stages.

RNNT decoding, as described in [6], does not provide word timestamps. The decoding process provides a probability total for all time alignments for each hypothesis within the beam. However, within a single alignment, the decoder either aligns a word piece (or grapheme) to an input feature vector, or decides to consume another feature vector. Extracting time alignments should be possible with small modifications to the decoding algorithm in [6].

[20] investigates word timings in a two pass system (as in [8] and [2]) and argues that word timings emitted by a word piece RNNT system are not as accurate as those of a low frame rate conventional model (LFR, as in [21]). They tackle this issue by adding word boundary output labels. They also augment the loss with two more terms corresponding to alignment error.  They do this for both the RNNT loss and LAS. In LAS, they task a single attention head to be in charge of predicting alignments, where the time alignment of a word piece corresponds to the maximum value in the attention probability vector of that attention head. With these modifications they find LAS to be the best at computing word timings and better than an LFR baseline.

3.4 Real Time Speech Recognition

Real time speech recognition is constrained by three aspects:

  • The first aspect is defined by the ability of the decoding algorithm to provide speech recognition results as it digests feature vectors, which is known as streaming speech recognition. This is clearly not possible with LAS, but is possible with RNNT. This is explained in [8]. The attention mechanism of LAS requires the entire audio stream to be processed by the encoder before the decoder can start emitting output labels ([5]). In the case of RNNT, for each input feature vector, one or more output labels are emitted ([6]).
  • The second aspect is defined by the CPU or GPU consumption of the decoding algorithm and the models, which can be measured by the real time factor (RTF). RTF is the ratio of how long it takes to process an utterance over how long the utterance is. [1] performs symmetric parameter quantization to bring the 90 percentile RTF of RNNT models from 1.43 down to 0.51 in mobile devices.
  • The third aspect is defined by the latency of the system, which is the amount of time the user has to wait, from the moment he stops speaking, to the point he receives the final speech recognition result. In a mixed RNNT and LAS rescoring system, the LAS computation has to be done after RNNT decoding is finished. This means LAS adds all it’s time consumption to latency time. [2] reduces latency by moving parts of the LAS computation to the first RNNT pass. It also parallelizes the LAS processing of arcs from the nbest lattice. This removes almost all latency produced by LAS. Also, an end-of-speech label is included in the output layer of the RNNT model, effectively allowing the model to learn when to predict an endpoint. This removes the need for an external endpointer, and improves WER (as compared to using an external endpointer) by 10% relative.

4. Conclusions

We surveyed recent research literature about Listen Attend and Spell (LAS) and Recurrent Neural Network Transducers (RNNT). We kept a perspective related to production accuracy and production features. Literature suggests that RNNT models are more accurate than CTC. It also suggests that, while LAS may be more accurate than RNNT, a mix of both can achieve better accuracy and feature parity when compared to hybrid RNN-HMM models.

The test sets used in these research literature works may be unique and different than those of other domains, but the accuracy tables may provide good guidance for experiments that could be performed in the process of bringing LAS or RNNT architectures into our own ASR design.

Having looked at accuracy and feature parity at a high level in this blog post, in future blog posts we will focus on the implementation details of LAS, RNNT and we will compare them to CTC.


Subscribe to our blog!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

You may also like

Checkout some of our recent research and product updates

Getting started with HttpClientFactory in C# and .NET 5
Getting started with HttpClientFactory in C# and .NET 5

HttpClientFactory has been around the .NET ecosystem for a few years now. In this post we will look at 3 basic implementations of HttpClientFactory; basic, named, and typed.

Feature Announcement: Content Safety Detection
Product Updates
Feature Announcement: Content Safety Detection is now GA!

Automatically transcribe audio and video files, and surface sensitive content, such "Hate Speech" or "NSFW" content, found within the audio.

Changelog: New Speaker Diarization model released
Changelog: New Speaker Diarization model released

We have released a new Diarization model. Speaker diarization is the process of partitioning an input audio stream into homogeneous segments according to the speaker identity.


Unlock your media with our advanced features like PII Redaction,
Keyword Boosts, Automatic Transcript Highlights, and more