Cursor's engineering team explains how they evaluate model quality using a hybrid online-offline approach. Their internal benchmark, CursorBench, is built from real Cursor sessions using Cursor Blame to trace committed code back to agent requests. Unlike public benchmarks such as SWE-bench, CursorBench uses tasks from internal codebases, features intentionally underspecified prompts, and covers longer multi-file tasks—better reflecting how developers actually use coding agents. CursorBench-3 shows stronger model separation at frontier levels where public benchmarks are saturated. Online evals complement the offline suite by catching regressions where outputs score well in grading but feel worse to real users. Future plans include adapting the suite for long-running agents working across sessions.

5m read timeFrom cursor.com
Post cover image
Table of contents
# The limitations of public benchmarks# Building CursorBench# CursorBench shows more separation between models# CursorBench scores align with online evals# The next eval suite

Sort: