Stanford research analyzing 120,000 developers reveals that AI coding tools show a median 10% productivity gain, but outcomes vary dramatically between teams. Top performers compound gains while strugglers fall behind. Key findings: codebase cleanliness (tests, documentation, modularity) strongly correlates with AI productivity gains (0.40 R²). Token usage volume shows weak correlation (0.20), suggesting quality of AI usage matters more than quantity. A case study of 350 engineers showed 14% more PRs but 9% lower code quality and 2.5x more rework, resulting in no net productivity gain. The research introduces an AI engineering practices benchmark detecting AI usage patterns in codebases and proposes measuring ROI through engineering output (via ML model replicating expert panels) plus guardrail metrics for quality and rework, rather than simple PR counts.
Sort: