In this video, we break down Meta AI’s DINOv3, the latest advancement in computer vision foundation models. Much like large language models in NLP, DINOv3 is designed as a general-purpose backbone in Computer Vision.

We'll thoroughly explain the self-supervised learning process that was used to train DINOv3.

We'll cover both the DINO and iBOT losses which were already part of DINOv2.

Finally, we'll explain the main innovation in DINOv3's training - Gram Anchoring.

📝Full Review: https://aipapersacademy.com/dinov3
📄Paper: https://arxiv.org/abs/2508.10104
___________________
🔔 Subscribe for more AI paper reviews!

📩 Join the newsletter → https://aipapersacademy.com/newsletter/

Become a patron - https://www.patreon.com/aipapersacademy

The video was edited using VideoScribe - https://tidd.ly/44TZEiX
___________________
Chapters:
0:00 Introduction
1:00 What Is A Foundation Model?
2:47 DINOv33 Results
3:57 Data Curation
5:45 The DINO Loss
8:05 The iBOT Loss
9:39 DINOv2 Scaling Issues
11:00 Gram Anchoring
12:32 Gram Anchoring Results

AI Papers Academy

Meta AI's DINOv3 is the next generation of the DINOv2 vision foundation model, scaling from 1B to 7B parameters and from 142M to 1.7B training images. It produces frozen embeddings reusable across tasks like segmentation and depth estimation without fine-tuning. Training combines DINO loss (global image understanding via self-distillation) and iBOT loss (patch-level local detail). A key challenge when scaling was degradation of patch-level consistency during long training runs. The solution is Gram Anchoring, a new loss component that uses an earlier model checkpoint as a 'Gram teacher' and enforces that pairwise relationships between patch features remain consistent, preventing dense task performance collapse while preserving global classification accuracy. DINOv3 outperforms DINOv2, Google's SIGLIP 2, and Meta's Perception Encoder across most vision benchmarks.

DINOv3 Paper Explained: The Computer Vision Foundation Model