Self-Supervised Learning - Pretext Tasks

$$\gdef \sam #1 {\mathrm{softargmax}(#1)}$$ $$\gdef \vect #1 {\boldsymbol{#1}} $$ $$\gdef \matr #1 {\boldsymbol{#1}} $$ $$\gdef \E {\mathbb{E}} $$ $$\gdef \V {\mathbb{V}} $$ $$\gdef \R {\mathbb{R}} $$ $$\gdef \N {\mathbb{N}} $$ $$\gdef \relu #1 {\texttt{ReLU}(#1)} $$ $$\gdef \D {\,\mathrm{d}} $$ $$\gdef \deriv #1 #2 {\frac{\D #1}{\D #2}}$$ $$\gdef \pd #1 #2 {\frac{\partial #1}{\partial #2}}$$ $$\gdef \set #1 {\left\lbrace #1 \right\rbrace} $$
🎙️ Yann LeCun

Success story of supervision: Pre-training

In the past decade, one of the major success recipes for many different Computer Vision problems has been learning visual representations by performing supervised learning for ImageNet classification. And, using these learned representations, or learned model weights as initialization for other computer vision tasks, where a large quantum of labeled data might not be available.

However, getting annotations for a dataset of the magnitude of ImageNet is immensely time-consuming and expensive. Example: ImageNet labeling with 14M images takes roughly 22 human years.

Because of this, the community started to look for alternate labeling processes, such as hashtags for social media images, GPS locations, or self-supervised approaches where the label is a property of the data sample itself.

But an important question that arises before looking for alternate labeling processes is:

How much labeled data can we get after all?

  • If we search for all images with object-level category and bounding box annotations then there are roughly 1 million images.
  • Now, if the constraint for bounding box coordinates is relaxed, the number of images available jumps to 14 million (approximately).
  • However, if we consider all images available on the internet, there is a jump of 5 orders in the quantity of data available.
  • And, then there is data apart from images, which requires other sensory input to capture or understand.

Fig. 1: Variation in available data quantum basis complexity of annotation

Hence, drawing from the fact that ImageNet specific annotation alone took 22 human years worth of time, scaling labeling to all internet photos or beyond is completely infeasible.

Problem of Rare Concepts /Long Tail Problem

Generally, the plot of the distribution of labels for internet images looks like a long tail. That is most of the images correspond to very few labels, while there exist a large number of labels for which very few images exist. Thus, getting labeled samples for categories towards the end of the tail requires huge quantities of data to be labeled because of the nature of the distribution of categories.

Fig. 2: Variation in distribution of available images with labels

Problem of Different Domains:

This method of ImageNet pre-training and fine-tuning on downstream task gets even murkier when the down-stream task images belong to a completely different domain, such as medical imaging. And, obtaining a dataset of the quantum of ImageNet for pre-training for different domains is not possible.

What is self-supervised Learning

Two ways to define self-supervised learning

  • Basis supervised learning definition, i.e. the network follows supervised learning where labels are obtained in a semi-automated manner, without human input.
  • Prediction problem, where a part of the data is hidden, and rest visible. Hence, the aim is to either predict the hidden data or to predict some property of the hidden data.

How self-supervised learning differs from supervised learning and unsupervised learning:

  • Supervised learning tasks have pre-defined (and generally human-provided) labels,
  • Unsupervised learning has just the data samples without any supervision, label or correct output.
  • Self-supervised learning derives its labels from a co-occurring modality for the given data sample or from a co-occurring part of the data sample itself.

Self-supervised Learning in Natural Language Processing


  • Given, an input sentence, the task involves predicting a missing word from that sentence, which is specifically omitted for the purpose of building a pre-text task.
  • Thus, the set of labels becomes all possible words in the vocabulary, and, the correct label is the word that was omitted from the sentence.
  • Thus, the network can then be trained using regular gradient-based methods to learn word-level representations.

Why Self Supervised Learning

  • Self-supervised learning enables learning representations of the data by just observations of how different parts of the data interact.
  • Thereby drops the requirement of huge annotated data.
  • Additionally, enables to leverage multiple modalities that might be associated with a single data sample.

Self-supervised Learning in Computer Vision

Generally, computer vision pipelines that employ self-supervised learning involve performing two tasks, a pre-text task and a text (downstream) task.

  • The text (downstream) task can be anything like classification or detection task, with insufficient annotated data samples.
  • The Pre-text task is the self-supervised learning task solved to learn visual representations, with the aim of using the learned representations, or model weights obtained in the process for downstream task.

Developing Pre-text tasks

  • Pre-text tasks for computer vision problems can be developed using either images, video, or video and sound.
  • In each pre-text task, there is part visible and part hidden data, while the task is to predict either the hidden data or some property of the hidden data.

Example pre-text tasks, predicting relative position of image patches

  • Input: 2 image patches, one is the anchor image patch while the other is the query image patch.
  • Given the 2 image patches, the network needs to predict the relative position of the query image patch with respect to the anchor image patch.
  • Thus this problem can be modeled as an 8-way classification problem, since there are 8 possible locations for a query image, given an anchor.
  • And the label for this task can be automatically generated by feeding the relative position of query patch with respect to the anchor.

Fig. 3: Relative positions task

Visual representations learned by relative position prediction task

We can evaluate the effectiveness of the learned visual representations by checking nearest neighbors for a given image patch basis feature representations provided by the network. For computing nearest neighbors of a given image patch

  • Compute the CNN features for all images in the dataset, that will act as the sample pool for retrieval.
  • Compute CNN features for the required image patch.
  • Identify nearest neighbors of the feature vector of the required image, from the pool of feature vectors of images available for retrieval.

Relative positioning task finds out image patches that are very similar to the input image patch, while maintains invariance to factors such as object color. Thus, the relative positioning task is able to learn visual representations, where representations for image patches with similar visual appearance are close in the representation space.

Fig. 4: Relative positions nearest neighbours

Predicting Rotation of Images

  • Predicting rotations is one of the most popular pretext task which has a simple and straightforward architecture and requires minimal sampling.
  • We apply rotations of 0, 90, 180, 270 degrees to the image and send these rotated images to the network to predict what sort of rotation was applied to the image and the network simply performs a 4-way classification to predict the rotation.
  • Predicting rotations do not make any semantic sense, we are just using this pretext task as a proxy to learn some features and representations to be used in a downstream task.

Fig. 5: Rotations of Image

Why rotation helps or why it works?

It has been proven that it works empirically. The intuition behind it is in order to predict the rotations, model needs to understand the rough boundaries and representation of an image. For example it will have to segregate the sky from the water or sand from the water or will understand that trees grow upwards and so on.


Fig. 6: Colorisation

In this pretext task, we predict the colors of a grey image, it can be formulated for any image we just remove the color and feed this greyscale image to the network to predict its color. This task is useful in some respects like for colorizing the old greyscale movies we can apply this pretext task. The intuition behind is the network needs to understand some meaningful information like that the trees are green, the sky is blue and so on.

It is important to note that color mapping is not deterministic, and several possible true solutions exist. So, for an object if there are several possible colors then the network will color it as grey which is the mean of all possible solutions, there have been recent works using Variational Auto Encoders and latent variables for diverse colorization.

Fill in the blanks

We hide a part of an image and predict the hidden part from the remaining surrounding part of the image. This works because the network will learn the implicit structure of the data like to represent that cars run on roads, buildings are composed of windows and doors and so on.

Pretext Tasks for videos

Videos are composed of sequences of frames and this notion is the idea behind supervision, which can be leveraged for some pretext tasks like predicting the order of frames, fill in the blanks and object tracking.

Shuffle and Learn

Fig. 7: Interpolation

Given a bunch of frames we extract three frames and if they are extracted in the right order we label it as positive and if they are shuffled label it as negative, this now becomes a binary classification problem to predict if the frames are in the right order or not. So given a start and end point, we check if the middle is a valid interpolation of the two.

Fig. 8: Shuffle & Learn architecture

We can use a triplet siamese network, where the three frames are independently fed forward and then we concatenate the generated features and perform the binary classification to predict if the frames are shuffled or not.

Fig. 9: Nearest Neighbors Representation

Again, we can use the Nearest Neighbors alorithm to visualize what our networks are learning. In the fig. 5 above first we have a query frame which we feed-forward to get a feature representation and look at the nearest neighbors in that feature representation and while comparing we can observe a stark difference between ImageNet, Shuffle and Learn and Random.

ImageNet is good at collapsing the entire semantic as it could figure that it is a gym scene for the first input, similarly, it could figure that it is an outdoor scene with grass etc. Whereas when we observe Random we can see that it gives high importance to the background color.

On observing Shuffle and Learn it is not immediately clear if it is focusing on the color or on the semantic concept. After further inspection and observing various examples, it was observed that it is looking at the pose of the person, like in the first image the person is upside down and in second the feet is in a particular position similar to query frame, ignoring the scene or background color. The reasoning behind it is that our pretext task was predicting if the frames are in the right order or not and to do this the network needs to focus on what is moving in the scene, in this case, the person.

It was verified quantitatively by fine-tuning this representation to the task of human key-point estimation, where given a human we predict where certain key points like nose, left shoulder, right shoulder, left elbow, right elbow, etc. are. This method is useful for tracking and pose estimation.

Fig. 10: Key point Estimation comparison

In the figure, we have compared the results for supervised ImageNet and Self-Supervised Shuffle and Learn on FLIC and MPII datasets and we can see that Shuffle and Learn gives good results for key-point estimation.

Pretext Tasks for videos and sound

Video and Sound are multi-modal where we have two modalities or sensory inputs one for video and one for sound. Where we try to predict whether the given video clip corresponds to the audio clip or not.

Fig. 11: Video and sound sampling

Given a video with audio of a drum, sample the video frame with corresponding audio and call it a positive set and next take the audio of a drum and the video frame of a guitar and tag it as a negative set. Now we can train a network to solve this as a binary classification problem.

Fig. 12: Architecture

Architecture: Pass the video frames to the vision subnetwork and pass the audio to the audio subnetwork, which gives 128-dimensional features and embeddings, we then fuse them together and solve it as a binary classification problem predicting if they correspond with each other or not.

It can be used to predict what in the frame might be making a sound, the intuition is if it is a sound of a guitar, the network roughly needs to understand how the guitar looks and similarly drums.

Understanding what the “pretext” task learns

  • Pretext tasks should be complementary

    • Let’s take for example the pretext tasks Relative Position and Colorization. We can boost performance by training a model to learn both pretext tasks as shown below:

Fig. 13: Comparison of disjoint vs combined training of Relative Position and Colorization pretext tasks. ResNet101. (Misra)
  • A single pretext task may not be the right answer to learn SS representations

  • Pretext tasks vary greatly in what they try to predict (difficultly)

    • Relative position is easy since it’s a simple classification
    • Masking and fill-in is far harder –> better representation
    • Contrastive methods generate even more info than pretext tasks.
  • Question: How do we train multiple pre-training tasks?

    • The pretext output will depend on the input. The final fully-connected layer of the network can be swapped depending on the batch type.
    • For example: A batch of black-and-white images is fed to the network in which the model is to produce a colored image. Then, the final layer is switched, and given a batch of patches to predict relative position.
  • Question: How much should we train on a pretext task?

  • Rule of thumb: Have a very difficult pretext task such that it improves the downstream task
  • In practice, the pretext task is trained, and may not be re-trained. In development, it is trained as part of the entire pipeline.

Scaling Self-Supervised Learning

Jigsaw Puzzles

  • Partition an image into multiple tiles and then shuffle these tiles. The model is then tasked with un-shuffling the tiles back to the original configuration. (Noorozi & Favaro, 2016)

    • Predict which permutation was applied to the input
    • This is done by creating batches of tiles such that each tile of an image is evaluated independently. The convolution output are then concatenated and the permutation is predicted as below:

Fig. 14: Siamese network architecture for a Jigsaw pretext task. Each tile is passed through independently, with encodings concatenated to predict a permutation. (Misra)
  • Considerations:
    1. Use a subset of permutations. (I.e.: From 9!, use 100)
    2. The N-Way ConvNet uses shared parameters
    3. The problem complexity is the size of the subset. The amount of information you are predicting.
  • Sometimes, this method can perform better on downstream tasks than supervised methods, since the network is able to learn some concepts about the geometry of its input.

  • Shortcomings: Few Short Learning: Limited number of training examples

    • Self supervised representations are not as sample efficient

Evaluation: Fine-tuning vs Linear Classifier

This form of evaluation is a kind of Transfer Learning.

  • Fine Tuning: When applying to our downstream task, we use our entire network as an initialization for which to train a new one, updating all the weights.

  • Linear Classifier: On top of our pretext network, we train a small linear classifier to perform our downstream task, leaving the rest of the network intact.

A Good representation should transfer with a little training

  • It is helpful to evaluate the pretext learning on a multitude of different tasks. We can do so by extracting the representation created by different layers in the network as fixed features and evaluating their usefulness across these different tasks.
    • Measurement: Mean Average Precision (mAP) –The precision averaged across all the different tasks we are considering.
    • Some examples of these tasks include: Object Detection (using Fine-Tuning), Surface Normal Estimation (See NYU-v2 Dataset)
  • What does each layer learn?
    • Generally, as the layers become deeper, the mean average precision on downstream tasks using their representations will increase
    • However, the final layer will see a sharp drop in the mAP due to the layer becoming overly specialized.
      • This contrasts with supervised networks, in that the mAP generally always increases with depth of layer.
      • This shows that the pretext task is not well-aligned to the downstream task.

📝 Aniket Bhatnagar, Dhruv Goyal, Cole Smith, Nikhil Supekar
06 Apr 2020