Vision Transformers (ViT) solve the problem of applying transformers to images by breaking images into fixed-size patches instead of feeding individual pixels. Feeding raw pixels creates an impractically large quadratic attention matrix (e.g., 65K×65K for a 256×256 image). Patches are flattened, projected via a trainable linear layer, combined with positional embeddings, and fed as a sequence to a standard transformer. A special learnable class token aggregates global image information for classification. Compared to CNNs, ViT has lower inductive bias — self-attention allows every patch to attend to every other patch at every layer, and even positional embeddings are learned from scratch rather than hand-designed.

4m watch time

Sort: