Michał Chromiak's blog

DINO v2: Match and transfer features across domains and understand relations between similar parts of different objects

"DINOv2: Learning Robust Visual Features without Supervision" - Research Paper Explained

MetaAI DINOv2 is a foundation model for computer vision. DINOv2 paper shows that existing pretraining methods, especially self-supervised methods, can produce all purpose visual features, (i.e., features that work across image distributions and tasks without finetuning) if trained on enough curated data from diverse sources. It turns out that the approach of DINOv2 can match patch-level features between images from different domains, poses and even objects that share similar semantic information. This exhibits the ability of DINO v2 model to transfer across domains and understand relations between similar parts of different objects. The Meta shows its strengths and wants to combine DINOv2 with large language models.

View comments.

more ...

Masked autoencoder (MAE) for visual representation learning. Form the author of ResNet.

"Masked Autoencoders Are Scalable Vision Learners" - Research Paper Explained

MAE is a simple autoencoding approach that reconstructs the original signal - image - given its partial observation. Thanks to successful introduction of patching approach in ViT it has become more feasible for CV as an alternative to convnets. The MEA paper use the ViT's patch-based approach to replicate masking strategy (similarly to BERT) for image patches. MAE randomly samples (without replacement) uniformly distributed and non-overlapping patches regularly created from image. MAE learns very high-capacity models that generalize well. Thanks to very high masking ratio (e.g., 75%) authors were able to reduce the training time by >3x while at the same time reducing the memory consumption thus, enabling the MAE to scale better for large models, like ViT-Large/-Huge on ImageNet-1K with 87.8% accuracy.

View comments.

more ...

Decision Transformer: Unifying sequence modelling and model-free, offline RL

"Decision Transformer: Reinforcement Learning via Sequence Modeling" - Research Paper Explained

Can we apply massive advancements of Transformer approach with its simplicity and scalability to Reinforcement Learning (RL)? Yes, but for that - one needs to approach RL as a sequence modeling problem. The Decision Transformer does that by abstracting RL as a conditional sequence modeling and using language modeling technique of casual masking of self-attention from GPT/BERT, enabling autoregressive generation of trajectories from the previous tokens in a sequence. The classical RL approach of fitting the value functions, or computing policy gradients (needs live correction; online), has been ditched in favor of masked Transformer yielding optimal actions. The Decision Transformer can match or outperform strong algorithms designed explicitly for offline RL with minimal modifications from standard language modeling architectures.

View comments.

more ...

MLP-Mixer: MLP is all you need... again? ...

"MLP-Mixer: An all-MLP Architecture for Vision" - Research Paper Explained

Let's try to answer the question: is it enough to have the FFN MLP, with matrix multiplication routines and scalar non-linearities to compete with modern architectures such as ViT or CNNs? No need for convolution, attention? It sounds that we have been here in the past. However, does it mean that the researchers are lost and go rounding in circles? It turns out that what has changes along the way is the increase in the scale of the resources and the data which originally helped ML and especially DL flourish past 5-7 years ago. We will discuss the paper which proves that MLP based solutions can replace CNN and attention based Transformers with comparable scores at image classification benchmarks and at pre-training/inference costs similar to SOTA models.

View comments.

more ...

DINO: Improving supervised ViT with richer learning signal from self-supervision

"Emerging Properties in Self-Supervised Vision Transformers" - Research Paper Explained

Self-DIstillation with NO labels (DINO) is a self-supervised method based on Vision Transformer (ViT) from Facebook AI with the ability to learn representation from unlabeled data. The architecture is able to learn automatically class-specific features, allowing the unsupervised object segmentation. The paper claims that the self-supervised methods adapted to ViT not only works very well, but one can also observe that the self-supervised ViT features contain explicit semantic segmentation information of an image, which is not that clear in case of supervised ViT, nor with convnets. The benefit of such observation is that such features are also very good k-NN classifiers. The performance results are reported to be highly dependent on two SSL approaches: the momentum teacher and multicrop training. In this blog post we will explain the details on what DINO is all about.

View comments.

more ...

RL Primer

Explaining the fundamental concepts of Reinforcement Learning

The objective of RL is to maximize the reward of an agent by taking a series of actions in response to a dynamic environment. Breaking it down, the process of Reinforcement Learning involves these simple steps: Observation of the environment, deciding how to act using some strategy, acting accordingly

View comments.

more ...

ERNIE 2.0: A continual pre-training framework for language understanding

ERNIE 2.0 (Enhanced Representation through kNowledge IntEgration), a new knowledge integration language representation model that aims to beat SOTA results of BERT and XLNet. While pre-training with more than just several simple tasks to grasp the co-occurrence of words or sentences for language modeling, Ernie aims to explore named entities, semantic closeness and discourse relations to obtain valuable lexical, syntactic and semantic information from training corpora. Ernie 2.0 focus on building and learning incrementally pre-training tasks through constant multi-task learning. And it brings some interesting results.

View comments.

more ...

NLP: Explaining Neural Language Modeling

Language modeling (LM) is the essential part of Natural Language Processing (NLP) tasks such as Machine Translation, Spell Correction Speech Recognition, Summarization, Question Answering, Sentiment analysis etc. Goal of the Language Model is to compute the probability of sentence considered as a word sequence. This article explains how to model the language using probability and n-grams. It also discuss the language model evaluation with use of perplexity.

View comments.

more ...

The Transformer – Attention is all you need.

Transformer - more than meets the eye! Are we there yet? Well... not really, but...
How about eliminating recurrence and convolution from transduction? Sequence modeling and transduction (e.g. language modeling, machine translation) problems solutions has been dominated by RNN (especially gated RNN) or LSTM, additionally employing the attention mechanism. Main sequence transduction models are based on RNN or CNN including encoder and decoder. The new transformer architecture is claimed however, to be more parallelizable and requiring significantly less time to train, solely focusing on attention mechanisms.

View comments.

more ...

Neural Networks Primer

When you approach a new term you often find some Wiki page, Quora answers blogs and it sometimes might take some time before you find the true ground up, clear definition with meaningful example. I will put here the most intuitive explanations of basic topics. Due to extended nature of aspects and terms that are used across NN area, in this post I will place condensed definitions and a brief explanations – just to understand the intuition of terms that are mentioned in other posts along this blog.

View comments.

more ...