New Java Benchmark for Coding LLMs puts GPT-5 at the top
The Brokk Power Ranking introduces a new open-source benchmark for evaluating coding LLMs using 93 real-world Java tasks from large codebases. GPT-5 dominates performance across all categories and price points, though it suffers from slower inference speeds. The benchmark addresses limitations of existing tools like SWE-bench by using fresh, complex tasks that better reflect real-world coding scenarios. Chinese models performed worse than expected, and the study reveals that context length and task complexity significantly impact model performance.