AI labs are running out of public internet data to train on, and their aggressive moves into free coding tools, billion-dollar acquisitions (Cursor, Windsurf), and distillation attacks on competitors all stem from the same root cause: the need for new, high-quality training data. Every interaction with a free AI coding assistant generates expert-annotated workflow data that doesn't exist on the public internet. Frontier models are converging in quality because they trained on the same data, driving API prices down 60-80%. The author argues models will fully commoditize within 1-2 years, with differentiation shifting to execution speed, integrations, and privacy — and advises developers to be deliberate about which tools they use and what data they're comfortable sharing.
Table of contents
The data is goneStrategy 1: Build tools that generate dataStrategy 2: Acquire the data moatStrategy 3: Distill from competitorsWhat this means for developersThe commodity floorSort: