Cursor's engineering team explains how they evaluate model quality using a hybrid online-offline approach. Their internal benchmark, CursorBench, is built from real Cursor sessions using Cursor Blame to trace committed code back to agent requests. Unlike public benchmarks such as SWE-bench, CursorBench uses tasks from internal

5m read time From cursor.com
Post cover image
Table of contents
# The limitations of public benchmarks# Building CursorBench# CursorBench shows more separation between models# CursorBench scores align with online evals# The next eval suite

Sort: