Revisiting Transformer Architectures for Potential Efficiency Gains

AIModels.fyi

Researchers from ETH Zurich systematically investigated which components of the standard transformer block architecture can be removed without degrading performance or training efficiency. Key findings include: residual connections can be eliminated by constraining self-attention initialization, value and projection matrices can be fixed as identity matrices without performance loss, and switching from sequential to parallel attention/feedforward sub-blocks allows dropping remaining residual connections. However, removing normalization layers entirely hurts fine-tuning performance. The resulting simplified architecture matches standard transformer training efficiency on language modeling tasks, offering a path toward cheaper and more accessible NLP model training.

Simplifying transformer blocks

The Surging Popularity and Increasing Scrutiny Over Transformer Efficiency

The Multi-Headed Self-Attention Mechanism Behind Transformers

The Standard Transformer Block Architecture

A Methodical Exploration Removing Non-Essential Components

Significance - A Promising Step Towards Cheaper Transformer Training