Deep Learning

Deep Learning Paper Recap - Automatic Speech Recognition

This week’s Deep Learning Paper Recaps are Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition and Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition

Deep Learning Paper Recap - Automatic Speech Recognition

This week’s Deep Learning Paper Recaps are Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition and Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition

Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition

What’s Exciting About this Paper

This paper proposes a novel method to generate samples that are not only stealthy, but that are also robust to being played over the air.

Source

Key Findings

This adversarial samples generated have the following properties:

  • Imperceptible: The attacked audio sounds extremely similar to the original audio such that a human cannot differentiate between the two.
  • Robust: The attacked audio should be effective even when it is played over the air. For example, an audio sample played by a speaker, recorded by a microphone, and then supplied to a model.

Our Takeaways

While the imperceptible attack is stealthy with a high success rate, the imperceptible + robust attack needs to be improved, which is currently at just a 50% success rate. The attack seems to be weakened by resampling audio.

Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition

What’s Exciting About this Paper

Fine-tuning the pretrained wav2vec model for each downstream task leads to one big model per task, which is expensive to deploy. However, this paper shows that applying adapters reduces the number of parameters required to adapt during fine-tuning. Instead of 90% of the parameters, only 10% of the parameters need to be fine-tuned. This enables us to reuse 90% of the parameters for each downstream task.

Key Findings

The authors insert adapter layers in each of the transformer encoder blocks of the wav2vec model. Inside of each adapter layer, they do a linear down-projection followed by a linear up-projection. They also add skip connections.

Source

The authors trained the entire model with English speech data. Then, they ran two experiments with French speech data:

  • They fine-tuned the entire network (95.6% of the parameters)
  • They fine-tuned only the adapter layers (9.2% of the parameters)

Results are measured in word error rate (WER) on French test data. It shows a similar performance for both experiments:

  • 40.2% WER fine-tuning the hole network
  • 39.4% WER fine-tuning only the adapter layers

Our Takeaways

Adapters show that it is not required to fine-tune the whole model for downstream tasks. Instead, it is enough to fine-tune specific adapter layers inserted carefully into the model.

Using adapters would allow us to reuse 90% of the parameters for each downstream task rather than deploying one fine-tuned model per task.