Coffee + Software

A conversation between John Willis and Josh Long highlights a critical flaw in AI model benchmarks: models have essentially memorized outdated SRE benchmark tests (mostly Python-focused, from three years ago), scoring 70-80% and appearing highly capable. However, when given modern tasks in Java, the same models perform like beginners. The point is that benchmark gaming creates a misleading picture of true AI capability.

Models and benchmarks, genius or failing students - John Willis and Josh Long #CoffeeSoftware