Adam-mini is a newly introduced optimizer that significantly reduces memory usage in training large language models while maintaining or enhancing performance. Traditional methods like the Adam optimizer require extensive memory, doubling the resource needs due to the storage of first-order and second-order momentum values. Adam-mini addresses this by partitioning model parameters into blocks based on the Hessian structure of transformers and assigning a single effective learning rate to each block. This strategical partitioning reduces memory usage by 45% to 50% and improves throughput by nearly 50%, making the training of large models more efficient and accessible, especially for researchers with limited GPU resources.

4m read timeFrom marktechpost.com
Post cover image

Sort: