Review - Speech processing Universal Performance Benchmark Review

This week’s Deep Learning Research paper is “SUPERB: Speech processing Universal PERformance Benchmark Review.”

What’s Exciting About this Paper

The trend in natural language processing (NLP) and computer vision (CV) has been pre-training with large amounts of unlabeled data and fine-tuning for various downstream tasks. One of the key gaps in developing pre-trained models for speech applications has been that there are no systematic tasks and benchmarks available previously. This paper introduces for the first time various tasks and benchmarks to measure generalizability of pre-trained models on various speech tasks and measure its performance.

Key Findings

Speech Tasks: The paper categorizes various tasks to measure speech performance into four categories - Content based speech tasks which includes Automatic Speech Recognition (ASR) and Keyword Spotting, Speaker-based tasks which include Speaker Diarization, Semantic-based tasks including intent classification, and Paralinguistics-based tasks. A lot of the tasks and benchmarks have been around the speech community for a while. This is the first time they have put together to measure the performance of pre-trained models.
HuBERT Large outperforming Wav2Vec2.0 Large: This paper also measures both generative and discriminative models trained on librilight and librispeech datasets on the above tasks. HuBERT Large, which has a similar number of parameters to Wav2Vec2.0 Large, outperforms it in eight of the twelve tasks which is surprising. HuBERT Large, also with minimum adaptation, achieves State-of-the-Art results in some of these tasks.

Our Takeaways

This paper introduces a simple framework to measure performance of pretrained self-supervised models on various speech tasks and measure its performance. This will fuel the research in representation learning and general speech processing.

Review - Speech processing Universal Performance Benchmark Review

What’s Exciting About this Paper

Key Findings

Our Takeaways

Popular posts

AI trends in 2024: Graph Neural Networks

AI for Universal Audio Understanding: Qwen-Audio Explained

Combining Speech Recognition and Diarization in one model

How DALL-E 2 Actually Works