Researchers propose using Neural Cellular Automata (NCA) as a synthetic pre-pre-training data source for language models, bypassing the need for natural language text. NCA trajectories — generated by randomly sampled neural networks acting as transition rules on a grid — are tokenized and used to train transformers via next-token prediction. With only 164M tokens, NCA pre-pre-training outperforms training on natural language (C4) and other synthetic data across web text, math, and code benchmarks. Even when C4 is given 10× more data (1.6B tokens), NCA still converges 1.4× faster and achieves 5% better final perplexity. The key mechanism: NCA sequences contain no semantic shortcuts, forcing models to develop in-context rule inference and robust induction heads in attention layers. Optimal NCA complexity varies by target domain, offering a new lever for targeted, compute-efficient training. The long-term vision is foundation models that acquire reasoning from fully synthetic data before learning semantics from a small curated corpus.

4m read timeFrom hanseungwook.github.io
Post cover image
Table of contents
We're running out of textNeural Cellular Automata as synthetic fuelThe surprising payoffWhat drives the transfer?A purer training signalBeyond one-size-fits-allCitation

Sort: