Glitches in the Attention Matrix

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

Vision Transformers (ViTs) and Large Language Models suffer from "high-norm artifacts" or "attention sinks" - anomalous tokens with 2-10x larger norms that emerge in middle-to-late layers, primarily in low-information areas. These artifacts stem from the SoftMax function forcing attention weights to sum to 1, causing models to

12m read timeFrom towardsdatascience.com
Post cover image
Table of contents
1. Discovery of the Artifacts in ViTs with DINOv22. The Register Solution: Vision Transformers Need Registers (2024)3. The Denoising Solution: Denoising Vision Transformers (2024)4. The Distillation Solution: Self-Distilled Registers (2025)5. The Mechanistic Solution: Test-Time Registers (2025)Relationship between ViT High-Norm Artifacts and LLM Attention Sinks7. Removing the Artifacts with Sigmoidal Gating: Gated Attention (2025)8. Conclusion

Sort: