NanoGPT Slowrun is an open benchmarking effort by Q Labs focused on data-efficient learning algorithms for language models. Unlike speedrun benchmarks that optimize wall-clock time, Slowrun trains on a fixed 100M token dataset (FineWeb) with unlimited compute, rewarding algorithms that achieve the lowest validation loss. Community contributions have already pushed data efficiency from 2.4x to 5.5x versus modded-nanogpt in just days. Key findings include Muon optimizer outperforming AdamW/SOAP/MAGMA, the importance of multi-epoch training with aggressive regularization (weight decay up to 16x standard plus dropout), shuffling at epoch start, learned value embedding projections, SwiGLU activations, and model ensembling. Open research directions include second-order optimizers, diffusion models, curriculum learning, and gradient descent alternatives. The project aims for 10x data efficiency short-term and potentially 100x by year-end.

3m read timeFrom qlabs.sh
Post cover image
Table of contents
What we've found so farUpdate: 5.5x Data EfficiencyDirections we think are wide open

Sort: