Deep Learning

Deep Shallow Fusion for RNN-T Personalization

This week’s Deep Learning Research paper is “Deep Shallow Fusion for RNN-T Personalization.”

Deep Shallow Fusion for RNN-T Personalization

This week’s Deep Learning Research paper is “Deep Shallow Fusion for RNN-T Personalization.”

What’s Exciting About this Paper

End-to-end deep learning ASR models can produce highly accurate transcriptions, but they are a lot harder to personalize. Their end-to-end nature lacks composability, such as that between acoustic, language, and pronunciation models. Lack of composability leads to challenges personalization, making it harder to accurately predict custom vocabularies and rare proper nouns. This paper walks through some methods that help increase the accuracy of proper nouns and rare words from end-to-end deep learning models.

Key Findings

The paper talks about four different techniques to help improve proper nouns. But we thought two, in particular, were more interesting as they were simpler but still produced good accuracy improvements. With these suites of training tricks, you can improve the models’ ability to predict proper nouns and rare words.

  • Subword regularization: During training, instead of directly feeding the highest probable prediction from previous timesteps into the current timestep, you can sample from a list of n-best outputs, and use that as input. This makes it so the model doesn’t overfit on high-frequency words and should predictions for rarer words
  • Grapheme-2-Grapheme: You can use a G2G model to augment your dataset! G2G models can transform a word into alternative spellings with similar pronunciations, such as “Kaity” → “Katie.” Using G2G to generate additional pronunciations for decoding led to significant improvement in rare name recognition.

Our Takeaways

End-to-end ASR models can overfit high-frequency words, making it hard for the model to predict rare words. By augmenting the data with G2G and adding a little randomness into the training regime, you can reduce the overfitting of high-frequency words, and train the model to increase the probability of predicting low-frequency words like proper nouns.