Millions of GitHub projects, but how many have excellent code? Training LLMs on 98% garbage means 2% good stuff, if you're lucky. A critical look at AI data quality. #AICode #DataQuality #LLMTraining #GitHub

Serious CTO's resource offers insights, tutorials, and resources for Chief Technology Officers (CTOs) and technology leaders. Readers can learn about technology strategy, team management, and leadership skills. With articles, interviews, and case studies, Serious CTO provides  guidance and expertise for leading technology organizations and driving innovation.

The Serious CTO

A short clip raises the concern that only 1-2% of GitHub projects contain genuinely good code, meaning LLMs trained on GitHub data are effectively learning from 98% low-quality code. The implication is that the training data quality problem is a fundamental challenge for AI coding tools.

Training LLMs on GitHub: The 2% Good Code Problem #shorts