Researchers propose using Neural Cellular Automata (NCA) as a synthetic pre-pre-training data source for language models, bypassing the need for natural language text. NCA trajectories — generated by randomly sampled neural networks acting as transition rules on a grid — are tokenized and used to train transformers via next-token prediction. With only 164M tokens, NCA pre-pre-training outperforms training on natural language (C4) and other synthetic data across web text, math, and code benchmarks. Even when C4 is given 10× more data (1.6B tokens), NCA still converges 1.4× faster and achieves 5% better final perplexity. The key mechanism: NCA sequences contain no semantic shortcuts, forcing models to develop in-context rule inference and robust induction heads in attention layers. Optimal NCA complexity varies by target domain, offering a new lever for targeted, compute-efficient training. The long-term vision is foundation models that acquire reasoning from fully synthetic data before learning semantics from a small curated corpus.
Table of contents
We're running out of textNeural Cellular Automata as synthetic fuelThe surprising payoffWhat drives the transfer?A purer training signalBeyond one-size-fits-allCitationSort: