Unifying Supervised and Self-Supervised Learning

home..

Unifying Supervised and Self-Supervised Learning

July 2022

image generated from craiyon.com

Unsupervised representation learning, also known as self-supervised learning, has been shown to be useful in a wide variety of domains and tasks. The most common approach is to use some kind of contrastive loss function that compares positive and negative samples. We show a connection with supervised learning by moving from a supervised loss to a self-supervised loss.

Preliminaries

A DNN can be interpreted as computing similarities between embeddings. Assume we are solving a problem with a linear model of the form

\[y = w^{T}x\]

If we are dealing with binary classification, then the sign of \(y\) is the class. The single weight vector is orthogonal to the decision boundary. The distance of the decision boundary from the origin is determined by the bias. Note that I have omitted the bias in the above equation. We can always implicitly add the bias into the model by adding a dummy dimension to \(x\) with value one¹.

When we are dealing with more than two classes we have a weight matrix \(W\) instead of a single weight vector. Each column of the matrix is a vector associated with a particular class. In this multiclass case we can make a similar geometric interpretation. Instead of the decision boundary being orthogonal to any particular vector, the decision boundary between any two classes is orthogonal to the difference between the weight vectors. It is now the difference between weight vectors that has an interpretation instead of the individual weight vectors¹.

An alternative interpretation is to think of the weight vector associated with each class as an exemplar for that class. Given this interpretation, the decision boundary between those exemplars maximizes the margin between those two points (modulo the bias controlling the relative distance between the two classes). We now have an interpretation of the weight vectors themselves as opposed to just their differences². The goal of the linear classifier is to distill the training data from each class into a single point.

Moving beyond linear classifiers, our DNN models, especially when trained with cross-entropy and a softmax layer, can be thought of as a feature extractor followed by logistic regression. In this interpretation, the entire DNN prior to the final classification layer is a feature extractor that we can call \(f(x)\). What follows is a linear classifier or logistic regression on top of the feature extractor³. The power of the DNN is the joint learning of the feature extraction and classification.

Following from the interpretation that the weight vectors in the model can be interpreted as exemplars, we can say that a DNN computes the similarity between the input and a set of learned embeddings in some learned feature space. More formally, let \(f(x)\) be the output of a DNN before the final layer. The final score \(z\) prior to the softmax is a linear transform of this \(f(x)\).

\[z = W^{T} f(x)\]

We can compute each individual element of \(z\) as

\[\begin{equation} z_{i} = w_{i}^{T} f(x) \label{eq:z_score} \end{equation}\]

where \(w_{i}\) is a column of the weight matrix \(W\). We have one \(z_{i}\), and therefore one \(w_{i}\), for each class. We can think of the \(w_{i}\) as being the learned embedding for class \(i\).The score \(z_{i}\) is how similar the output of the network \(f(x)\) is to the embedding \(w_{i}\) in terms of its dot product. Most approaches to self-supervised learning rely on the computation of similarity between embeddings. We now see that our standard supervised models perform the same function.

Cross Entropy as a Contrastive Loss Function

We typically think of cross-entropy loss as maximizing the posterior of the true class prediction given the data or, equivalently, minimizing the log.

\[\mathcal{L}_{xe} = - \log(p(y|x))\]

Normally \(p(y|x)\) is computed by a softmax function over the final output from a model \(z\). As discussed above the \(z\) is the dot product between the final weight matrix and the output from the previous layer. However, there are many variants of this loss function that look like cross-entropy, but have other characteristics. In an attempt to unify these loss functions, we can make cross-entropy more generic.

\[\begin{equation} \mathcal{L}_{xe} = \sum_{b} - \log \left(\frac{\exp(\text{sim}(x_b, w_c))/\tau}{\sum_{k}\exp(\text{sim}(x_b, w_k))/\tau} \right) \label{generic-xe} \end{equation}\]

The outside summation indexed by \(b\) is over examples in the batch. The vectors \(x\) and \(w\) are the training samples and the final weight vectors, respectively. \(w_{c}\) is the weight vector associated with the correct class and the summation indexed by \(k\) is over all classes. The parameter \(\tau\) is the temperature. When the similarity function is the dot product and \(\tau\) is one, we have the standard cross-entropy function used in training a DNN.

We can view cross-entropy as a contrastive loss function. In order to simply the equations, we can redefine the \(z\) score from Equation \eqref{eq:z_score} as

\[z_{i} = \text{sim}(f(x), w_{i}).\]

Our cross-entropy loss—temporarily ignoring the summation over the batch and the temperature—becomes

\[\mathcal{L}_{xe} = - \log \left(\frac{\exp(z_{c})}{\sum_{i=0}^{|Z|} \exp(z_{i})} \right)\]

We can now separate this into two terms, a similarity term and a contrastive term.

\[\begin{equation} \mathcal{L}_{xe} = - \underbrace{ \log (\exp(z_{c}))}_{\text{similarity term} } + \underbrace{ \log(\sum_{i=0}^{|Z|} \exp(z_{i}))}_{\text{contrastive term}} \label{eq:xe_contrastive} \end{equation}\]

We have a true score \(z_{c}\) whose value we want to be as large as possible. It is compared against the \(z_{i}\) from all other classes. Note that the contrastive term also contains the similarity term. While in many descriptions of contrastive loss functions, the similarity term is contained in the contrastive term, it is not universal⁴. If we do remove the similarity term, this transforms the cross-entropy equation into the odds ratio⁵. Now that we have a generic formulation for cross-entropy and a relationship with contrastive loss functions, we can analyze the relationship with individual loss functions and models.

Relationship with wav2vec 2.0

wav2vec 2.0⁶ uses a contrastive objective function. We define it using slightly different notation here. Let the prediction of the context vector from the input be the function \(f(x)\). Let \(Q\) be the set of negative examples along with the positive example \(q_c\). The final score \(z\) is defined as

\[z = \text{sim}(f(x), Q)\]

where sim is a similarity function. In the case of wav2vec, it is the cosine similarity function. The loss is known as normalized temperature cross-entropy (NT-Xent)⁷, which itself is a variant of InfoNCE⁸. Ultimately, the loss looks like cross-entropy.

\[\begin{equation} \mathcal{L}_{\text{w2v}} = - \log \left( \frac{\exp(\text{sim}(x,q_c)/\tau)}{\sum_{k} \exp(\text{sim}(x,q_k)/\tau)}\right) \label{nt-xent} \end{equation}\]

\(\tau\) is the temperature parameter—less than 1 in this case. The equation is identical to our generic cross-entropy function in Equation \eqref{generic-xe}. The only difference is we have left out the summation over the batch and we are using the vector \(q\) instead of \(w\). While we refer to the matrix \(W\) as the set of exemplars and wav2vec refers to \(Q\) as the set of negative samples, they serve the same function. They are both learned quantizations of the feature space. For self-supervised learning the quantization is done in an unsupervised manner, while for supervised learning the quantization is based on the class label.

Relationship with HuBERT

While HuBERT⁹ initially appears quite different from the wav2vec approach, we show they are similar. Start from the wav2vec 2.0 objective function, Equation \eqref{nt-xent}. Instead of cosine similarity, we will use the dot product. We will also remove the temperature parameter (or equivalently set \(\tau\) to 1).

\[\mathcal{L}_{\text{HuBERT}} = - \log \left( \frac{\exp(x^{T} q_c)}{\sum_{k} \exp(x^{T} q_k)}\right)\]

For wav2vec, the set of negative samples \(Q\) are sampled for each minibatch. For HuBERT we use a fixed set of negative samples that we represent as \(W\).

\[\mathcal{L}_{\text{HuBERT}} = - \log \left( \frac{\exp(x^{T} w_c)}{\sum_{k} \exp(x^{T} w_k)}\right)\]

Note that this now looks exactly like the traditional form of cross-entropy for training DNN classifiers. Ultimately, the primary difference between HuBERT and wav2vec is the similarity function and the definition of negative samples.

Unifying Supervised and Contrastive Learning

We have shown that we can interpret our cross-entropy trained DNN classifiers in a contrastive framework. The only difference between HuBERT and supervised learning is the selection of class labels. For supervised learning the labels come from some labeling mechanism that we assume is correct. For HuBERT the labels are for pseudo-classes derived from clustering. In both cases the association between training points and their quantized representation (\(w_i\)) remain fixed, though the representation itself is updated for each minibatch.

wav2vec is similar, but there are a few key differences. The set of negative examples is much larger and each minibatch samples a subset from the full set. The association between training points and quantized representation can also be updated at each minibatch.

Whether our training is supervised or self-supervised with a contrastive loss, the key question is how to select the negative samples. With supervised learning the decision is easy because we have been given predefined labels. For self-supervised learning we have to make this decision based on other information.

References

Christopher M Bishop (2006): Pattern recognition and machine learning, Springer. ↩ ↩²
While the interpretation of the weight vectors as exemplars feels obvious, I am unaware of any discussion of this in the standard texts. All of the texts I have consulted only mention this geometric interpretation, but I would be very interested in seeing examples of other interpretations. ↩
This interpretation is why some were dismissive of deep learning, calling it “fancy regression.” ↩
Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny (2021): Barlow twins: Self-supervised learning via redundancy reduction. ↩
This can be seen by multiplying the numerator and denominator by \(\sum p(y\vert x)\). Note that this does change the gradient. ↩
Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli (2020): wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. ↩
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton (2020): A Simple Framework for Contrastive Learning of Visual Representations. ↩
Aaron van den Oord, Yazhe Li, and Oriol Vinyals (2018): Representation Learning with Contrastive Predictive Coding. ↩
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed (2021): HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, arXiv preprint arXiv:2106.07447. ↩

Comments? Send me an email.