Deep Learning

Deep Learning Paper Recaps - Modality Matching and Masked Autoencoders

This week’s Deep Learning Paper Recaps are MAESTRO: Matched Speech Text Representations through Modality Matching and Masked Autoencoders that Listen.

Deep Learning Paper Recaps - Modality Matching and Masked Autoencoders

This week’s Deep Learning Paper Recaps are MAESTRO: Matched Speech Text Representations through Modality Matching and Masked Autoencoders that Listen.

MAESTRO: Matched Speech Text Representations through Modality Matching

What’s Exciting About This Paper?

This paper explores a new method for learning unified representations from speech and text modalities in an efficient way. The researcher's proposed method outperforms the SOTA for ASR tasks.

Monolingual ASR - Source
Multilingual ASR - Source

Key Findings

In the paper, the modality matching algorithm learns to unify speech and text representations. It allows us to incorporate lexical information by leveraging text-only inputs and can help improve ASR performances in both monolingual and multilingual setups.

It only takes a small amount of supervised data to effectively unify representations.

Our Takeaways

Learning joint speech and text representations allows for knowledge sharing between the two modalities. As text-only data is easier to collect, it could help improve ASR models accuracy, especially on low speech resources languages.

Masked Autoencoders that Listen

What’s Exciting About This Paper?

In this paper, the authors present a novel State-of-the-Art image-based masked autoencoder extension to audio.

The model works by splitting the mel spectrogram into patches, masking ~80% of the patches, and then passing unmasked patches into the encoder and passing all reordered patches into the decoder. The objective is to reconstruct the entire spectrogram, although the MSE loss is calculated only on masked patches and their ground truth pairs.

The paper demonstrates that autoencoders could be as good as novel contrastive self-supervised learning approaches.

Source

Key Findings

Masked autoencoder pretraining could be extended not only to static images, but to temporal information such as audio and video. The extremely high patching ratios could also lead to more robust models in quality and bias settings.

The authors also show that local attention works better than global for speech domain. This could be explained by the fact that neighboring features in spectrograms are highly correlated.

Our Takeaways

Masked autoencoders are back - they can produce competing results to novel contrastive models. High patching ratio takes away feature overfitting and bias, thus making models extremely robust towards general features.