Deep Learning

Review - Speech processing Universal Performance Benchmark Review

This week’s Deep Learning Research paper is “SUPERB: Speech processing Universal PERformance Benchmark Review”

Review - Speech processing Universal Performance Benchmark Review

This week’s Deep Learning Research paper is “SUPERB: Speech processing Universal PERformance Benchmark Review.”

What’s Exciting About this Paper

The trend in natural language processing (NLP) and computer vision (CV) has been pre-training with large amounts of unlabeled data and fine-tuning for various downstream tasks. One of the key gaps in developing pre-trained models for speech applications has been that there are no systematic tasks and benchmarks available previously. This paper introduces for the first time various tasks and benchmarks to measure generalizability of pre-trained models on various speech tasks and measure its performance.

Key Findings

  • Speech Tasks: The paper categorizes various tasks to measure speech performance into four categories - Content based speech tasks which includes Automatic Speech Recognition (ASR) and Keyword Spotting, Speaker-based tasks which include Speaker Diarization, Semantic-based tasks including intent classification, and Paralinguistics-based tasks. A lot of the tasks and benchmarks have been around the speech community for a while. This is the first time they have put together to measure the performance of pre-trained models.
  • HuBERT Large outperforming Wav2Vec2.0 Large: This paper also measures both generative and discriminative models trained on librilight and librispeech datasets on the above tasks. HuBERT Large, which has a similar number of parameters to Wav2Vec2.0 Large, outperforms it in eight of the twelve tasks which is surprising. HuBERT Large, also with minimum adaptation, achieves State-of-the-Art results in some of these tasks.

Our Takeaways

This paper introduces a simple framework to measure performance of pretrained self-supervised models on various speech tasks and measure its performance. This will fuel the research in representation learning and general speech processing.