Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

Researchers propose using Neural Cellular Automata (NCA) as a synthetic pre-pre-training data source for language models, bypassing the need for natural language text. NCA trajectories — generated by randomly sampled neural networks acting as transition rules on a grid — are tokenized and used to train transformers via next-token prediction. With only 164M tokens, NCA pre-pre-training outperforms training on natural language (C4) and other synthetic data across web text, math, and code benchmarks. Even when C4 is given 10× more data (1.6B tokens), NCA still converges 1.4× faster and achieves 5% better final perplexity. The key mechanism: NCA sequences contain no semantic shortcuts, forcing models to develop in-context rule inference and robust induction heads in attention layers. Optimal NCA complexity varies by target domain, offering a new lever for targeted, compute-efficient training. The long-term vision is foundation models that acquire reasoning from fully synthetic data before learning semantics from a small curated corpus.

Training Language Models via Neural Cellular Automata

Neural Cellular Automata as synthetic fuel