Towards Data Science is a community-powered publication that showcases work in data science, machine learning and artificial intelligence. Every day newcomers, seasoned researchers and industry practitioners publish tutorials, research notes and real-world case studies that help the field move forward.

Towards Data Science

Vision Transformers (ViTs) and Large Language Models suffer from "high-norm artifacts" or "attention sinks" - anomalous tokens with 2-10x larger norms that emerge in middle-to-late layers, primarily in low-information areas. These artifacts stem from the SoftMax function forcing attention weights to sum to 1, causing models to repurpose tokens as global information storage. While they don't impact classification tasks, they significantly hurt dense tasks like object detection (20% improvement when fixed). Solutions have evolved from requiring full retraining (register tokens) to zero-cost post-hoc fixes (test-time registers) and architectural changes (gated attention in Qwen3-Next). The latest research shows these artifacts actually store useful global information, and removing them can harm performance - suggesting they're a feature, not just a bug.

Glitches in the Attention Matrix

1. Discovery of the Artifacts in ViTs with DINOv2

2. The Register Solution: Vision Transformers Need Registers (2024)

3. The Denoising Solution: Denoising Vision Transformers (2024)

4. The Distillation Solution: Self-Distilled Registers (2025)

5. The Mechanistic Solution: Test-Time Registers (2025)

Relationship between ViT High-Norm Artifacts and LLM Attention Sinks

7. Removing the Artifacts with Sigmoidal Gating: Gated Attention (2025)