In this video we explain the research paper titled Vision Transformers Need Registers by Meta AI, which was written by authors that were part of DINOv2 paper. This paper describes a phenomenon in DINOv2 outputs that does not happen in DINOv1, which is referred as artifacts. 
This phenomenon was discovered to be relevant for more large foundational computer vision models in addition to DINOv2, which are OpenCLIP and DeiT. 
We start with essential background about visual features, and then explain what are the artifacts, what is their impact and when do they appear. Afterwards, we'll describe the solution suggested in the paper to avoid these artifacts, using register tokens in vision transformers, and then we'll see how this method performs.

👍 Please like & subscribe if you enjoy this content

Blog post - https://aipapersacademy.com/vision-transformers-need-registers/
Paper page - https://arxiv.org/abs/2309.16588
DINOv2 video summary & full review - https://aipapersacademy.com/dinov2-from-meta-ai-finally-a-foundational-model-in-computer-vision/

-----------------------------------------------------------------------------------------------
Support us - https://paypal.me/aipapersacademy

We use VideoScribe to edit our videos - https://tidd.ly/44TZEiX (affiliate)

We use ChatPDF to analyze research papers - https://www.chatpdf.com/?via=ai-papers (affiliate)
-----------------------------------------------------------------------------------------------

Chapters:
0:00 Agenda
0:53 Background
2:21 Artifacts
6:31 ViT Registers
7:40 Results
8:38 Conclusion

AI Papers Academy

Researchers from Meta AI discovered that large vision transformer models like DINOv2 develop 'attention map artifacts' — outlier high-norm tokens in background regions that store global information instead of local patch data. This degrades tasks like object discovery. The fix, called 'registers', adds extra tokens to the input sequence that the model uses to store global information instead of hijacking image patch tokens. Registers are discarded at output. Results show registers nearly eliminate artifacts and improve object discovery by ~20 points for DINOv2, with modest gains in segmentation and depth estimation, though DINOv1 still outperforms DINOv2+registers on object discovery. Classification and other tasks see only marginal improvements, so the memory/latency cost may not always be justified.

Vision Transformers Need Registers - Fixing a Bug in DINOv2?