Apple and EPFL researchers have introduced AdEMAMix, an innovative optimizer that integrates dual Exponential Moving Averages (EMAs) to enhance gradient efficiency in large-scale model training. By balancing fast-changing and slow-changing gradient information, AdEMAMix achieves faster convergence with fewer computational resources. This new approach significantly reduces token usage and computational costs while improving model performance and minimizing training instabilities.
Sort: