Marginlab provides a daily performance tracker for Claude Code with Opus 4.5, monitoring its performance on SWE-Bench-Pro tasks to detect statistically significant degradations. The tracker runs daily benchmarks on 50 test instances using the actual Claude Code CLI (no custom harnesses), applies statistical significance testing
Sort: