In the context of Deep Learning and Speech Recognition, there are three main types of neural network architectures that are widely used: Connectionist Temporal Classification (CTC), Listen-Attend-Spell (LAS) models, and Transducers (AKA RNNTs if only using Recurrent Neural Networks and their variants).
Transducers have recently become the best performing model architecture for most ASR tasks, and have surpassed CTC and LAS models, though each architecture has its advantages and disadvantages. In this article, we will examine the Transducer architecture more closely.
RNNTs or Recurrent Neural Networks Transducers were first introduced in 2012 by Alex Graves in the paper “Sequence Transduction with Recurrent Neural Networks”. Alex Graves also (impressively) authored the famous CTC paper, “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” published in 2006.
RNNTs were inspired by the limitations of CTC being highly dependent on an external Language Model to perform well. RNNTs however, did not get any real serious attention until the paper “Streaming End-to-end Speech Recognition for Mobile Devices” in 2018, which demonstrated the ability to use RNNTs on mobile devices for accurate speech recognition.
This paper revitalized research efforts into the Transducer architecture, which propelled Transducer models to achieve state-of-the-art accuracy over the past few years.
To understand Transducers, we can refer to the original paper by Alex Graves, “Sequence Transduction with Recurrent Neural Networks.” RNNTs were created to solve some of the shortcomings of CTC models, which required an external Language Model to perform well.
A CTC model has one module, the Encoder, used to model acoustic features.
CTC-modeled predictions are assumed to be independent of each other. To make this concept more concrete, let's walk through a simple example.
Imagine a CTC model outputs the transcript
"I have for apples". As a human reading this, you can immediately spot the error.
"I have for apples" should be
"I have four apples".
For a CTC model
"for" is just as sensical as
"four" as they are phonetically similar. Since a CTC model’s outputs are conditionally independent of each other, the output of the word
"for" does not take into consideration the surrounding context of words
"I have ... apples".
This flaw is an inherent property of the CTC loss function. To help mitigate this issue, certain neural network layers like Recurrent Neural Networks or Transformers can, internally, learn to model the surrounding context from the acoustic features. However, because the CTC loss function is still conditionally independent, a CTC model can still make these types of linguistic mistakes, and is, theoretically, overall less accurate because the CTC loss function does not incorporate context.
Because of these shortcomings, CTC models require an external Language Model, trained separately on millions to billions of sentences, to correct any linguistic errors the CTC model may output.
Compared to a CTC model, an RNNT model has three modules that are trained jointly: The Encoder, Predictor, and Joint network. These three each have their purpose.
The Encoder models the acoustic features of speech, and the Predictor acts as a Language Model to learn language information from the training data. Finally, the Joint network takes in the predictions from the Encoder and Predictor to produce a label.
The Predictor and the Joiner network are conditionally dependent, so the next prediction is reliant on the previous prediction. These combinations of modules trained jointly make an external Language Model unnecessary to gain high accuracy.
At AssemblyAI, we’ve recently transitioned our core transcription model from a CTC model to a Transducer model, and achieved substantially greater accuracy. However, we replace Recurrent Neural Networks with Transformers, in particular the Conformer variant of Transformers.
In the original paper, “Sequence Transduction with Recurrent Neural Networks” by Alex Graves, RNN layers were used in the model architecture.
Recurrent Neural Networks have since been dethroned as the go-to architecture to model sequential data. Transformers, first introduced in the paper “Attention is all you need”, have been at the center of attention when it comes to NLP and speech research.
The ability of Transformers to model global features from sequential data is what makes it so powerful. However, for speech, it makes sense to not only look at the global features in audio data, but local features as well, since acoustic features are more likely to be correlated with adjacent features than those that are far away.
The Conformer is a variant of the Transformer that was first introduced in the paper “Conformer: Convolution-augmented Transformers for Speech Recognition”. This paper’s thesis is that by interleaving Convolutional Neural Networks layers in between the self-attention layers in transformers, the model is forced to pay attention to both local and global features, getting the best of both worlds!
What We’ve Learned Experimenting with CTC and Transducers
Both CTC models and Transducers perform well in the real world. However, Transducer type models are clearly the way to go, as they are the leaders in accuracy, beating CTC and LAS speech recognition architectures. Over the last few years, and hundreds of experiments, here are some advantages and disadvantages we found with CTC versus Transducers.
- CTC models are easier to train! A CTC model has a single module, the encoder. This simplicity makes CTC really easy to train.
- There are more resources available for CTC models. Since CTC models have been the most popular architecture for Speech Recognition for so long, there is a large amount of research and open source tools to help you quickly build and train them.
- CTC models converge slower! Although CTC models are easier to train, we notice that they converge much slower than Transducer models. When training CTC models, more iterations are always needed than Transducers to achieve similar results.
- CTC models are worse with proper nouns. When it comes to proper nouns, CTC models seem to be less accurate. Proper nouns are rare words that may or may not be in the training corpus.
- CTC models require an external Language Model (LM) to perform well.
- Transducer models converge faster! With faster convergence, we are able to do more experiments, reducing the feedback loop for deep learning research.
- Transducer models are more accurate even with fewer parameters. Overall, we’ve found transducer models to be more accurate for proper nouns (+24.47%) and overall transcription accuracy (+18.72%), even if it is smaller in size than a similar CTC model.
- Transducer models are harder to train. The complexity of having to jointly train three networks increases the surface area for bugs!
- Transducer models have a larger memory footprint, making it harder to train larger models.
- Transducer models have fewer resources online to take advantage of. We are pretty much at the bleeding edge here with Conformers and Transducer models, so searching for answers online usually returns nothing.
For Transducers, in our experience, the pros outweigh the cons. By switching over to using Transducers at AssemblyAI, our research team has been able to explore the bleeding edge of ASR research, and provide our customers and developers with State-of-the-Art accuracy with our Speech-to-Text API.