This week’s Deep Learning Paper Review is ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.
What’s Exciting About this Paper
It has become apparent in the world of transformer-type models, and in particular as they pertain to Natural Language Processing, that increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, this idea has limitations, and at some point, it becomes difficult to keep increasing model sizes due to limitations on GPU/TPU memory and longer training times.
This paper aims to address this problem by proposing two parameter-reduction techniques that significantly reduces memory consumption and training time of BERT. The first technique is a factorization of embedding parameters, which decouples the WordPiece embedding size from the hidden layer size, thus significantly reducing the number of embedding parameters. The second technique is a cross-layer parameter sharing, where the weights of the feed-forward network and attention parameters are shared across layers, which further decreases the number of total parameters that need to be trained.
Here are the main configurations of BERT and ALBERT models analyzed in this paper:
The configurations above were tested across a range of different standard benchmarks for downstream tasks. Experiments show that ALBERT’s best configuration establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large. The table below shows how different ALBERT and BERT configurations compare when tested on some of these benchmarks.
One thing to note here is that even though ALBERT-xxlarge performs better than BERT-large across the board on the table above, there is a significant decrease in training speed. The authors acknowledge this, and explain that it is due to the fact that ALBERT-xxlarge is a much larger network, even though it has significantly fewer parameters. However, the authors experimented with training the two configurations for roughly the same amount of clock time, instead of the same number of training steps, and ALBERT-xxlarge still outperformed BERT-large, as shown in the table below.
This paper shows that there are ways to increment a model size (number of layers) without increasing the number of parameters, and still achieve State-of-the-Art performance. This is a powerful idea, and one that can be leveraged specifically in scenarios with limited GPU/TPU memory accessibility. It will be exciting to see new research aimed at reducing the number of hyper parameters as well as speeding up training and inference time of ALBERT-like transformers.