October 6, 2021

Improved Accuracy on AssemblyAI’s Real Time Speech-to-Text API

AssemblyAI's new real time Speech Recognition system has improved accuracy through new training and tokenization methods.

Streaming Speech-to-Text

Yujian Tang

Contributor

Yujian Tang

Contributor

Table of contents

[Visible on live site]

Get $50 in credits

A while back, we demonstrated how easy it was to do real time Speech Recognition with AssemblyAI’s Speech-to-Text API. Now, we’ve got an update to announce. Our real time Speech Recognition system has been updated to improve accuracy while maintaining the same model. How did we do it? Upgrades to our training methods and vocabulary tokenization method.

Improved Training Loss

There are three popular end-to-end training methods for Speech Recognition in the industry. The three methods are Connectionist Temporal Classification (CTC), Recurrent Neural Network Transducer (RNNT), and Auto Encoder Decoder (AED) models. Our real time model is mainly trained with a CTC loss with an added AED loss for stability. The new real time system uses intermediate CTC loss and bidirectional loss for AED models.

Traditionally, CTC loss is only calculated at the last layer. Using an intermediate loss in the CTC function increases the quality of the model’s lower level representations of text which is a way to introduce regularization for the model. AED models are usually only given a unidirectional representation of the text. For example, the English sentence “the quick brown fox jumps over the lazy dog” would be read in from left to right just like that. The bidirectional model would introduce a second representation to the model that would be read from right to left like “dog lazy the over jumps fox brown quick the”. This bidirectional encoding introduces a better understanding of what language looks like to the model.

Altered Vocabulary Tokenization

Tokenization is one of the most important parts of Speech Recognition. In our prior real time model, we used a “blank” or “separator” token to represent the space between words. When training a model, we want it to be “hard” for the model to learn representations so it doesn’t overfit. However, we conjectured that it may be unnecessarily hard as well as not very useful for the model to learn the representation for the space between words. Being able to predict word boundaries may not be useful to the overall goal of being able to predict the whole input. Our new model takes the approach of learning the start of words and has vocabulary marked with a “start” token derived from research on end-to-end streaming Speech Recognition.

For example, in the sentence “I am not a robot” the old model may have learned the structure

[“I”, “_”, “am”, “_”, “not”, “_”, “a”, “_”, “robot”]

while the new model would learn something like

[“_I”, “_am”, “_not”, “_a”, “_ro”, “bot”]

where bot doesn’t have the space in front of it so we know it’s not a new word.

Conclusion

Our new real time Speech Recognition system is more accurate. The accuracy improvement we achieved wasn’t achieved with a new type of model, but rather improvements on the way we train the model and the way we structure vocabulary for it. New training methods introduce more stability and regularization to allow for more consistent predictions while the new vocabulary tokenization allows for an easier way to predict the sentence structure. For more information follow @assemblyai or @yujian_tang