DINOv2: Learning Robust Visual Features without Supervision

https://t.me/reading_ai, @AfeliaN

🗂️ Project Page

📄 Paper

📎 GitHub

🗓 Date: 14 Apr 2023

Main idea

Motivation: foundation models play an important role in NLP tasks, and similar foundation models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning.
Solution: Dinov2 trained on curated data from diverse sources can produce such features. A big ViT (1B) parameters is trained and then distilled into a series of smaller models, that still show good results on various tasks

Preliminaries

Before discussing the Dinov2 model we need to make a small introduction to previous works.

Dino

Main idea: Dino - unified backbone, trained in a self-supervised manner. This method shares the overall structure as recent self-supervised approaches, however, it shares also similarities with knowledge distillation.

Knowledge distillation is a learning paradigm where a student network gθ_s is trained to match the output of a given teacher network gθ_t. Given an input image x, both networks output probability distributions over K dimensions denoted by P_s and P_t. The probability P is obtained by normalizing the output of the network g with a softmax function. More precisely

with τs > 0 a temperature parameter that controls the sharpness of the output distribution. If you want to find more information about knowledge distillation - you may watch the Russian lecture or lecture in English.

In case of Dino, the networks share the same architecture with different sets of parameters. The whole algorithm may be illustrated on the image or pseudocode below.

Fig 2. Self-distillation algorithm for Dino training

Fig 3. Torch pseudocode for Dino training

From the given image a set of different views is generated. This set contains two global views, that cover most parts of an input image (greater than 50%) and several local views.

All crops are passed through the student while only the global views are passed through the teacher, therefore encouraging “local-to-global” correspondences.

The authors minimize the loss:

Unlike knowledge distillation, a teacher doesn’t have a priory and it is built from past iterations of the student network using an exponential moving average (EMA).

To avoid collapse the authors propose a centering and sharpening of the momentum of teacher outputs. There are two forms of collapse:

regardless of the input, the model output is uniform along all the dimensions
output dominated by one dimension

The centering avoids the collapse induced by a dominant dimension, but encourages an uniform output. Sharpening induces the opposite effect. The authors showed this complementarity by decomposing the cross-entropy H into an entropy h and the Kullback-Leibler divergence (“KL”) D_KL

Dinov2 Pipeline

The main idea of Dinov2 algorithm is based on Dino self-supervised pretraining, but with several improvements. Such improvements can be divided into two main parts: data processing and pipeline improvements.

Data preprocessing

The authors find that dataset curation plays an important role for a model training. They suggested an algorithm to clean a large dataset.

The curated dataset contains ImageNet-22k, the train split of ImageNet-1k, Google Landmarks and several fine-grained datasets.

For the uncurated data source, the authors collect a raw unfiltered dataset of images from a publicly available repository of crawled web data.

From each web page in the repository, URL links of images from tags were extracted.
Then discards URLs that are unsafe or restricted by domains,
Post-process the downloaded images: PCA hash deduplication,
NSFW filtering
blurring identifiable faces

This results in 1.2B unique images.

Then the authors applied several more post-processing steps:
Deduplication
The authors build our curated pretraining dataset by retrieving images from the uncurated data source that are close to images in the curated sources.

Additionally, as an ablation study authors compared the results on different data sources.

Fig 8. Ablation of source of pretraining data

Pipeline itself

Image-level objective

The main idea of Dinov2 algorithm is based on Dino self-suprvised pretraining, but with several improvements. In the same way the authors consider the cross-entropy loss between the features extracted from a student and a teacher network. Both features are coming from the class token of a ViT, obtained from different crops of the same image and teacher is built with an exponential moving average of past iterates of a student.

Patch-level objective (iBOT)

Additionally, the authors suggested to randomly mask some of the input patches given to the student, but not to the teacher and calculate the same cross-entropy loss.

Sinkhorn-Knopp centering

Sinkhorn-Knopp centering replaces the teacher softmax-centering step.

KoLeo regularizer

Adapting the resolution

Increasing image resolution is key to pixel level downstream tasks such as segmentation or detection, where small objects disappear at low resolutions. However, training at high resolution is time and memory demanding, and instead, we increase the resolution of images to 518 × 518 during a short period at the end of pretraining.

Implementation details

The authors suggested several implementation improvements to increase the speed and decrease memory usage.

the authors implemented their own version of FlashAttention to improve memory usage and speed on the self-attention layers.
dinov2 version allows running in the same forward pass the global crops and the local crops. The lower-level components of our setup are available in the xFormers library
the improved version of stochastic depth was implemented that skips the computation of the dropped residuals rather than masking the result. With high drop rates (d = 40% in this work), this allows a drastic improvement in compute efficiency and memory usage. The implementation consists of randomly shuffling the B samples over the batch dimension, and slicing the first (1 − d) × B samples for the computations in the block.
Fully-Sharded Data Parallel (FSDP).

Minimizing the objective with the AdamW optimizer requires 4 model replicas in float32 precision – student, teacher, optimizer first moments, optimizer second moments. This sums to 16 GB of memory for a billion-parameter model such as our ViT-g. In order to reduce this memory footprint per GPU, the authors split the model replicas across GPUs, i.e., sharding 16 GB across GPUs using the PyTorch implementation of FSDP. Consequently, the model size is not bounded by the memory of a single GPU but by the total sum of GPU memory across compute nodes. The Pytorch implementation of FSDP brings a second advantage, which is to save on the cross-GPU communication costs: the weight shards are stored in float32 precision as required by the optimizer, but broadcasting weights and reducing gradients is done in float16 precision for the backbone (MLP heads gradients are reduced in float32 to avoid training instabilities). This leads to approximately 50% reduction in communication costs compared to the float32 gradient all-reduce operation used in DistributedDataParallel (DDP), which is used in other self-supervised pretraining methods.

Model distillation. The authors suggested distilling smaller models from big ones rather than training these models from scratch. The authors leverage the same training loop with a few exceptions:

use a larger model as a frozen teacher,
keep a spare EMA of the student that we use as our final model,
remove the masking and stochastic depth,
apply the iBOT loss on the two global crops.

The authors showed that even a large distilled model performs better than a trained model from scratch.

Fig 11. Effectiveness of knoweledge distillation

Pros and cons

I think it is really great work as the authors showed impressive results for various tasks even using small distilled models.

Results

The authors showed the results of Dinov2 model applied to various tasks and compared Dinov2 and existing pretrained backbones:

Self-supervised:

MAE, DINO, SEERv2, MSN, EsViT, Mugs, iBOT

Weakly supervised

CLIP, SWAG, OpenCLIP, EVA-CLIP

Tasks:

image classification

The authors evaluate the quality of features by training a simple classifier over a frozen backbone and do not perform finetuning of the backbone weights

Fig 12. Linear evaluation for image classification task

Additionally, authors showed the results of supervised finetuning:

Additional image and video classification results:

Fig 14. Image and video classification results

instance recognition

In this experiment, the authors probe the model on the task of instance-level recognition using a non-parametric approach. Images from a database are ranked according to their cosine similarity with a query image.

image segmentation

For the semantic segmentation evaluation, the authors consider two different setups:

linear: a linear layer is trained to predict class logits from a patch token.
boosted version of the linear setup: concatenate the patch tokens of the 4 last layers, use a larger image resolution of 640, and use multiscale test-time augmentations to improve the predictions

depth estimation

For depth prediction, the authors considered 3 different scenarios:

linear 1: extract the last layer of the frozen transformer and concatenate the [CLS] token to each patch token
linear 4: use the same protocol that we use with one layer, but concatenate the tokens from 4 different layers
dpt: use DPT decoder on top of the frozen models

Fig 18. Depth estimation and semantic segmentation images results

Fig 19. Out of distribution depth estimation and semantic segmentation results

Some other experiments:

Fig 22. Geographical fairness and diversity analysis

Fig 23. Label association fairness evaluation across gender, skitones and age groups

DINOv2: Learning Robust Visual Features without Supervision

Report Page