DINO: Improving supervised ViT with richer learning signal from self-supervision

The "Dino" paper explained.🔗

In this article we will explain and discuss the paper on emerging explicit features for semantic information of an image:

"Emerging Properties in Self-Supervised Vision Transformers: ArXiv

Architecture is based on pair of networks that learn to predict each others output. While the student network is trained by gradient descent with the cross-entropy loss function, the teacher network is updated with an exponentially moving average ema of the student network weights. Avoiding collapse is achieved by the use of centering and sharpening. Unexpectedly, the learned self-attention maps of the final layer learns class-specific features helping the unsupervised object segmentations.

TL;DR🔗

DINO seems like a evolutionary successor for solutions such as SimCLR, MoCO or the most recent BYOL.
Conversely to convnets, or supervised ViT - image attention maps/features of SSL ViT (self-DIstillation with NO labels -DINO), contain explicit image segmentation map. See Figure 1.
Student and Teacher networks both have the same architecture -- just with different parameters
SSL ViT features becomes a very good k-NN classifier -- thanks to momentum encoder and multi-crop augmentation.
Dino directly predicts the output of a teacher network -- built with a momentum encoder — by using a standard cross-entropy loss.
Comparing to remaining self-supervised learning (SSL) solutions, the most important observations are:
- Usage of the student-teacher which is the self-distillation element, using momentum teacher with centering and sharpening.
- Can use only centering and sharpening to avoid collapse.
- Not using contrastive learning, batch norm, nor negative samples.
DINO as a SSL framework works flexibly with ViT and convnets (paper uses ResNet-50).
For ViT with DINO, the smaller the patch (<16) the better performance, and without additional parameters.
DINO achieves 80.1% top-1 on ImageNet in linear evaluation with ViT-Base
Applications:
- classification, zero-shoot learning, self-attention maps are competitive for video instance segmentation, image copy detection etc. (See more)

$Self-attention from a Vision Transformer with 8 × 8 patches trained with no supervision \label{fig:dinofeatures}$ Figure 1. Object attention maps available explicitly from SSL ViT's features (Source).

Contribution of paper:🔗

Paper follows the general strategy of eliminating contrastive learning from SSL - as advised in "Self-supervised learning: The dark matter of intelligence" by Yann LeCun
Authors investigate impact of self-supervised pre-training on ViT features with following results.
- Self-supervised ViT match and even outperforms convnets for poorly labeled data sets.
- Self-supervised ViT's features contain explicit scene layout and object boundaries (self-attention modules of last block) (See Figure 1.). (Such presence of segmentation masks is known to be shared across different SSL methods).
Self-supervised ViT's features performs especially well for $k$-NN achieving 78.3% top-1 accuracy on ImageNet without fine-tuning, linear classifier nor data augmentation, but based only on momentum encoder and multi-crop augmentation.
Smaller image patches ($8\times8$) in ViT increase the quality of resulting features and improve performance (reduce running time, but with larger usage of memory - lower throughput) without adding additional parameters.

Objective, or goal for the algorithm🔗

The goal of the paper is to investigate how the supervision in vision Transformers impact their performance. As the supervision for image level tasks reduces some rich visual information by flattening it to predefined set of object concept, using self-supervised approach (similarly to what BERT/GPT do in the pre-training phase in NLP domain), for ViT and convnets provides better performance eliminating need for past developments in the field like predictor (BYOL), contrastive loss (momentum contrast), or advanced normalization (contrasting cluster assignments). DINO+ViT has also outperformed SSL systems based on convnets.

Motivation🔗

The representation learning (aka feature learning) is a task of learning the data representation out of a raw data that later can be used for a set of detailed tasks with only small modification - fine tuning - of the previously learned feature representation. Till now, the core of this area has been based on ResNets. DINO tends to improve this process for a self-supervised image representation learning by using specific double network setup, based on BYOL approach.

Inspiration for DINO research comes from the effectiveness of self-supervised pre-training procedures in Transformer based architectures, such as BERT (pre-training phase for language modeling in NLP domain) and GPT (language modeling), that enrich learning signal comparing to the supervised variant of predicting single label per sentence. Same situation is with supervised image categorization where entire rich image visual information is reduced to single concept selected from set of categories of objects.

As ViTs has clearly proven to outperform convnets for poorly labeled datasets DINO tries to investigate how this particular property can justify the transformers' demand for computation and large data volumes.

Intro🔗

Since 2017 when Transformer architecture was first introduced (Transformer: Attention is all you need - Explained), it has dominated research trends not only in NLP, but (as of 2020/21) also in computer vision tasks. The paper is based on most recent advancement with Vision Transformer (ViT) from Google AI, but there were more research also from Facebook AI such as the $DE$tection $TR$ansformer (DETR) or $D$ata-$e$fficient $i$mage $T$ransformers (DeiT) (the DeiT implementation is actually followed in DINO research).

Representation learning task🔗

The image representation learning is a task of feeding an input (here an image) through a function (most commonly a ResNet-50) to get best latent vector representation of an image. The quality of the learned representation should be at a level allowing to solve multiple downstream vision tasks. Hence, it is useful to create a representation based on large dataset and then use transfer learning for other tasks with not as much data. By adapting pre-trained architecture (already trained on large dataset), for a task with small dataset, one can commence fine tuning on the pre-trained model to solve a downstream task with much smaller dataset.

The self-DIstillation with NO labels (DINO) approach to train ViT for representation learning task has been discussed in the paper with very extensive experiments and ablation study.

Initially prevalent role of Transformer-based solutions in NLP has been recently adapted (ViT) with success into the computer vision (CV) domain. However, until now, this evolution in many cases has assumed the supervised training strategy for vision tasks.

The supervised ViT architecture is an alternative to convnets in visual recognition tasks, however it is also more demanding in terms of:

computation power,
size of the training data

Additionally, till now the class-specific, unique features properties were not being clearly exhibited.

The transformers in vision applications has become a prominent alternative to convnets, but the question rise: is this success due to the supervised nature of the pre-straining phase?

To answer this question, one approach is to couple ViTs with a self-supervised architecture, that would replace completely supervised pre-training in vision tasks, and that is what paper is actually exploring.

Notes on Self-Supervised (no labels) Learning (SSL):

SSL is a method of representation learning where supervised task is created out of the unlabeled data.

In contrast to "completely" unsupervised setting, the SSL uses information from the dataset itself to construct pseudo-labels.

In SSL the supervision is induced by self-supervised tasks rather than preset prior knowledge.

Large-scale labeled data sets are not required for human learning as we learn spontaneously, thus the SSL has a great potential to replace the supervised learning in terms of representation learning. That is because in general, human learning is more a Few-Shot Learning, where we actually have a small amounts of annotated data.

More on SSL from Yann LeCun: "Self-supervised learning: The dark matter of intelligence")

In general, the quest of SSL is to find clever solution that would allow to extract relevant features out of unlabeled data. Let's investigate how DINO approaches this challenge.

The processing strategy of the algorithm🔗

The main techniques applied in DINO are the self-supervised learning (SSL), self-training and knowledge distillation.

The image dataset has no labels in this case thus, DINO uses the data augmentation / adjustment to create tuples of images: (original_image, distorted_image) for kind of compare-and-contrast to help learn similarity features. The second element of the pair is distorted / noised, meaning it has some kind of alteration that decreases the amount of its visual - (non-semantic) - information, but only to the degree that it is still possible to claim that both images shows the same thing. Now, the architecture needs to learn to ignore those augmentations (without knowing anything about the augmentation / noise technique - i.e. flipping, cropping, solarization etc.) of the image, by learning only the original image representation.

The key is to craft an augmentations technique that would preserve the semantic information of an image within its augmented version. For instance, the random cropping or horizontal flipping change pixels, but retain the semantics thus, allowing to learn predictable representation.

In contrast to supervised learning, where the image-level supervision reduces the information form the image to a single concept (e.g.label) from a predefined set of object categories, DINO allows representation information to emerge and be particularly useful (e.g. in k-NN).

Conversely to prior solutions based on discrimination between all images in a dataset, that does not scale well with number of images - DINO uses representations from "mean" teacher network ¹- built as a momentum encoder (MoCo - Momentum Contrast).

In terms of self-supervised learning, DINO is modeled based on metric-learning formulation called BYOL that trains features by matching them to representations obtained with a teacher network (momentum encoder). However, DINO operates with different similarity matching loss than BYOL- the cross-entropy loss - and same architecture of student and the teacher network. Additionally DINO present clear solution to prevent collapsing (i.e. prevents learning of degenerated, trivial - e.g. constant - solutions, only for the sake of minimizing the cost).

DINO is a SSL referred in the paper as a form of a Mean Teacher²!!!!!!!!!!!! self-distillation with no labels.

Figure 2. DINO overview (Source).

Momentum teacher🔗

The DINO student and teacher have the same architecture where, the teacher is also built dynamically during training from the past iterations of the student network. This way distillation is applied as a self-supervised objective during training to the teacher's parameters by distilling from the exponential moving average (ema) of the student parameters.

Avoiding collapse🔗

One of the main advantages of DINO (similarly to BYOL) is that there is no need for negative samples to avoid collapse. While some general approaches include contrastive loss, clustering constraints, predictor or batch normalizations, DINO is only using centering and sharpening of the momentum teacher outputs. Both solutions (centering and sharpening) having opposite characteristics, tends to balance each other effect at the same time assuring collapse avoidance.

Centering🔗

Centering prevents one dimension dominance while paper

Sharpening🔗

Metaphors, or analogies to other architectures describing the behavior of the algorithm🔗

Heuristics or rules of thumb🔗

Is DINO a GAN?🔗

Applications: classes of problem is the algorithm well suited🔗

The representations that are produced thanks to DINO are very useful for multiple scenarios, e.g.:

Fine-tuning of linear classifiers on top of such representation with very good results of image classification.
Image retrieval, as similar images get clustered together
Zero-shoot classification by simply doing k-NN classifier in such feature DISPLAY_CATEGORIES_ON_MENU

Common benchmark or example datasets used to demonstrate the algorithm🔗

Succession / Following / Improving solutions.🔗

Evolution: MAE: Masked Autoencoders Are Scalable Vision Learners

Useful resources for learning more about the algorithm:🔗

Appendix - elaborate on terminology🔗

DINO is based on numerous concepts that can be cryptic for a reader new to the topic. Those ideas are not specific to DINO and has been present in the field for a while. To help you read and refer to this article I have prepared a clarification appendix for concepts that DINO has embraced. Thus, shedding more light on the DINO background and at the same time keep the main article concise.

What is a knowledge distillation?🔗

The intuition behind knowledge distillation is similar to general concept of elarning in machine learning. Once you find the right parameters you save them and apply in future. In general, starting from something that has already been working instead of starting from scratch.

It all started with self-training that used small set of initial annotations (label assignments) for a large set of unlabeled instances. The annotations can be with hard (a one-hot distribution), or soft (a continuous distribution) labels.

In short, knowledge distillation is process of learning of the student model to match a teacher model predictions. This way distillation means to compress the knowledge into a smaller student model, that would be comparable to the teacher model. (See a paper from G.Hinton et al.)

The teacher is usually a big model that is used to produce a smaller student model. Such a distilled student model is more effective than a adequate student model which would yield as a result of training from scratch.

What is Collapse?🔗

The term collapse refers to the state of network that maps everything to the same representation, so every time it would return success. For instance, a representation that is constant across views is always fully predictive of itself as everything feed through it will be treated the same - all is the same. In such case, learning a constant representation for any image $h=constant$ would minimize the error, but would not help to learn the representation. Collapse is therefore a trivial solution that should be avoided.

The discriminative contrastive learning methods used negative sample approach to overcome the collapse. Negative samples are the ones that prevents the solution from collapse by contrasting. Therefore, this type of training aims to reduce the distance (compare) between representations of different augmented views of the same image (‘positive pairs’), and increasing the distance (contrast) between representations of augmented views from different images (‘negative pairs’). Challenge with such approach, is that it is very fragile, The problem is two fold. First, for positive pairs, it is important to pick the right augmentation (e.g. random cropping) method to create the second image of positive pair. Choosing the right augmentation has great impact on the quality of learned representation. Secondly, in case of negative sample image, the retrieval of negative samples is a challenge. They are ought to be picked carefully³ as they are critically impacting the performance of the representation learning.

As contrastive methods often require comparing each example with many other examples to work well, BYOL proposed to get rid of the negative samples at all. In turn, BYOL can be assumed to avoid collapse most probably due to specific initialization, followed by the small-step learning to avoid moving out from local minimum to optimal (but undesired) solution in form of constant - collapse. Therefore, for BYOL the choice of right augmentation would boil down to try and error procedure which would be a drawback due to tedious additional effort.

As a side note it is worth mentioning about the Contrastive Learning (CL) and negative sample mechanisms.

Footnotes:🔗

As a response to Temporal Ensembling (maintains an exponential moving average - ema - of label predictions on each training example, while penalizing predictions that are inconsistent with this target) that change targets only once per epoch (unwieldy when learning large datasets).Mean Teacher, averages model weights instead of label predictions. As an additional benefit, Mean Teacher improves test accuracy and enables training with fewer labels than Temporal Ensembling ↩
A method that averages model weights instead of label predictions. See paper: Tarvainen et al. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results ↩
Retrieving negative samples is complex in terms of picking the right set of negative samples (should they be uniformly sampled?, should we buffer them?, what should be their order?, should we first start out with only easy examples of a task and then gradually increase the task difficulty (curriculum learning) ). It often either relies on large batch sizes, memory banks or customized mining strategies ↩