Reconstruction vs. Contrastive Loss

home..

Reconstruction vs. Contrastive Loss

September 2022

image generated from https://huggingface.co/spaces/stabilityai/stable-diffusion

An alternative approach to contrastive learning is reconstruction. The basic concept is to reconstruct some representation of the input where the input has been either masked or augmented. One approach is the traditional denoising autoencoder. The target representation can also be quantized as in DeCoAR¹. Because the reconstruction-based approach relies on a loss like mean square error (MSE), it is limited compared to a contrastive approach.

Consider the following derivation of MSE in matrix form. Let the model compute some representation of the input \(f(x_{i})\). This value is compared with a target representation \(e_{i}\) and we want to minimize the squared difference. This can be done using dot-products.

\[\begin{align} \mathcal{L}_{mse} = & \frac{1}{n} \sum_{i} (f(x_{i}) - e_{i})^{2}\\ \mathcal{L}_{mse} = & \frac{1}{n} \sum_{i} (f(x_i) - e_i)^{T}(f(x_i) - e_i) \\ \mathcal{L}_{mse} = & \frac{1}{n} \sum_{i} f(x_i)^{T}f(x_i) - f(x_i)^{T}e_i - e_{i}^{T}f(x_i) + e_{i}^{T}e_{i} \\ \mathcal{L}_{mse} = & \frac{1}{n} \sum_{i} f(x_i)^{T}f(x_i) - 2f(x_i)^{T}e_i + e_{i}^{T}e_{i} \\ \mathcal{L}_{mse} = & \frac{1}{n} \sum_{i} ||f(x_i)||^{2} + ||e_{i}||^{2} - 2f(x_i)^{T}e_i \\ \end{align}\]

The loss is minimized when \(f(x_i)\) and \(e_i\) are identical. If we assume all vectors are normalized, then the loss is maximized when \(f(x_i) = - e_i\), for all \(i\). If the vectors are unnormalized, then the loss can grow arbitrarily large as the norm of \(e_i\) increases. The final loss is biased by the squared norms of the vectors, but is ultimately controlled by the dot product between the learned representation \(f(x_i)\) and the target representation \(e_i\). If all vectors were normalized to unit length, then the final loss would be one minus the cosine similarity. This is similar to using only the numerator term in cross entropy or the similarity term in any of the contrastive loss functions. The primary difference between the reconstruction loss and the contrastive loss is the lack of negative samples in the former. Both still contain a term that seeks to maximize the similarity with the target representation.

Given this view, reconstruction seems a poor replacement for contrastive learning—though there are potential issues with contrastive learning as well. In the supervised case, this would be equivalent to maximizing \(f(x_c)^T(w_c)\), where \(c\) is the index of the true class, without any consideration of the relationship to alternative classes. We would not want to replace our cross-entropy loss with such a function in the supervised case. Since reconstruction loss lacks a contrastive term, what keeps it from suffering from a collapse of the representation space? The target representations \(e_i\) are typically fixed, so the loss has no control over them.

References

Shaoshi Ling and Yuzong Liu (2020): DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization, arXiv preprint arXiv:2012.06659. ↩

Comments? Send me an email.