A conversation between John Willis and Josh Long highlights a critical flaw in AI model benchmarks: models have essentially memorized outdated SRE benchmark tests (mostly Python-focused, from three years ago), scoring 70-80% and appearing highly capable. However, when given modern tasks in Java, the same models perform like beginners. The point is that benchmark gaming creates a misleading picture of true AI capability.
•1m watch time
Sort: