This week’s Deep Learning Paper Recap is Prune Once For All: Sparse Pre-Trained Language Models
What’s Exciting About This Paper?
Model pruning is one of the key ways to compress a Deep Learning model, and the pruning techniques differ based on the model architectures. This paper introduces an architecture-agnostic method of training sparse pre-trained language models.
This method enables us to prune only once during the pre-training phase and not worry about pruning during the fine-tuning. The researchers also propose a fine-tuning mechanism that leverages distillation to achieve the best compression-to-accuracy ratio.
Fine-tuning pruned (sparse) models usually leads to either poor results or a low sparsity ratio. That’s why modern pruning approaches like Gradual Magnitude Pruning (GMP) apply pruning during the fine-tuning phase.
But the problem with this approach is that each time we fine-tune, we have to consider both the task and model architecture to choose the pruning technique.
With the proposed pre-training and fine-tuning mechanism, we can save time by pruning only once. Here is what the whole pipeline looks like:
This technique leads to the best compression-to-accuracy ratio for BERT-base, BERT-Large, and Distil-BERT. Best scores were achieved with 85% and 90% weight pruning.
They also tried Quantized Aware Training (QAT) with 85% pruning, which led to an even more accurate and smaller model than the 90% pruned model.
These pre-trained pruned models can be used to obtain fine-tuned pruned models without the burden of task-specific pruning.
This approach saves us time and effort of pruning the model, similar to a lot of pre-trained Deep Learning models out there where we don’t have to train from scratch. We just use the pruned model for fine-tuning instead in this case.