Meta AI's DINOv3 is the next generation of the DINOv2 vision foundation model, scaling from 1B to 7B parameters and from 142M to 1.7B training images. It produces frozen embeddings reusable across tasks like segmentation and depth estimation without fine-tuning. Training combines DINO loss (global image understanding via self-distillation) and iBOT loss (patch-level local detail). A key challenge when scaling was degradation of patch-level consistency during long training runs. The solution is Gram Anchoring, a new loss component that uses an earlier model checkpoint as a 'Gram teacher' and enforces that pairwise relationships between patch features remain consistent, preventing dense task performance collapse while preserving global classification accuracy. DINOv3 outperforms DINOv2, Google's SIGLIP 2, and Meta's Perception Encoder across most vision benchmarks.

13m watch time

Sort: