Researchers from Meta AI discovered that large vision transformer models like DINOv2 develop 'attention map artifacts' — outlier high-norm tokens in background regions that store global information instead of local patch data. This degrades tasks like object discovery. The fix, called 'registers', adds extra tokens to the input sequence that the model uses to store global information instead of hijacking image patch tokens. Registers are discarded at output. Results show registers nearly eliminate artifacts and improve object discovery by ~20 points for DINOv2, with modest gains in segmentation and depth estimation, though DINOv1 still outperforms DINOv2+registers on object discovery. Classification and other tasks see only marginal improvements, so the memory/latency cost may not always be justified.
Sort: