Researchers from ETH Zurich systematically investigated which components of the standard transformer block architecture can be removed without degrading performance or training efficiency. Key findings include: residual connections can be eliminated by constraining self-attention initialization, value and projection matrices can be fixed as identity matrices without performance loss, and switching from sequential to parallel attention/feedforward sub-blocks allows dropping remaining residual connections. However, removing normalization layers entirely hurts fine-tuning performance. The resulting simplified architecture matches standard transformer training efficiency on language modeling tasks, offering a path toward cheaper and more accessible NLP model training.
Table of contents
The Surging Popularity and Increasing Scrutiny Over Transformer EfficiencyThe Multi-Headed Self-Attention Mechanism Behind TransformersThe Standard Transformer Block ArchitectureA Methodical Exploration Removing Non-Essential ComponentsSignificance - A Promising Step Towards Cheaper Transformer TrainingSort: