Table of contents
This week's Deep Learning Paper Review is VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning.
What’s Exciting About This Paper
Up until recently, the dominant self-supervised learning techniques (SSL) in computer science were based on contrastive objectives. This roughly means that a model is trained to 1) minimize the distance between representations of similar images (such as two augmented views of the same source image) while 2) pushing the distance of representations of different images apart.
By doing this, the model learns useful representations in the absence of labels, and this can be applied to any particular downstream task such as image classification.
One may ask why it is necessary to push apart representations of different images in addition to pulling together the representations of similar images. Wouldn't just doing the pulling work just as well and be more elegant due to its simplicity?
It turns out that if one only applies the pulling term, then the model will simply learn a trivial solution to the problem - it will produce the same representation for all input images. The inclusion of the pushing term (which makes the objective contrastive) effectively prevents this representation collapse.
However, recently there has been an emergence of several SSL techniques that get rid of the need for contrastive samples: BYOL, Barlow Twins, and VicReg. All three achieve similar performance, but I will be discussing VicReg because in my opinion it is the most theoretically sound of the three.
First, here's the problem formulation:
Given a minibatch of images, we will apply stochastic data augmentation twice to generate two branches X and X'. Let Z and Z' be the representations (encoder output) for X and X'. Let i represent the batch index - zi and zi' are the representations of two different augmentations of the same image. Let j represent the feature dimension index - zij represents the jth feature of representation zi.
VicReg stands for Variance-Invariance-Covariance Regularization. And the loss objective contains the three terms in its name. The picture below demonstrates the architecture and loss terms. Note that Y is an intermediate representation that will be used for downstream tasks, whereas we are performing the VicReg loss on the final representation Z.
This term simply minimizes the distance between the representation of the two different augmentations of the same image. This should be intuitive; however, note that without additional regularization terms, this term by itself will lead to representation collapse.
This term is a hinge loss that maintains the variance for each feature between representations in a batch above a certain threshold γ. Intuitively, this is saying: we want the values for each feature to be somewhat different across the representations in the batch. This ensures the representations do not all collapse to the same vector.
This term minimizes the covariance between the different features of the same representation. When covariance between two variables is small, it means the two variables are not correlated. Therefore, this term ensures that the different features in a representation are uncorrelated with each other.
Why is this a desirable objective? Well, suppose our representation has nine dimensions, and suppose all nine features are super correlated with each other (the value of one feature tells us the values for the other eight). If that's the case, then there's no reason for us to have nine features - only one will do. In other words, when the features are correlated, there is LESS information packed in the same amount of space. On the other hand, by enforcing the features to be uncorrelated, we encourage them to encode more information.
To tie it all up, the VicReg objective is simply the sum of the three terms.
Empirically, VicReg performs better than contrastive techniques. It performs only slightly worse than the other non-contrastive techniques (BYOL, SwAV), but in my opinion it is more interesting and has greater potential due to its simplicity and theoretical transparency.