We use a hybrid online-offline eval process to keep our understanding of model quality aligned with what developers actually do.

Cursor

Cursor's engineering team explains how they evaluate model quality using a hybrid online-offline approach. Their internal benchmark, CursorBench, is built from real Cursor sessions using Cursor Blame to trace committed code back to agent requests. Unlike public benchmarks such as SWE-bench, CursorBench uses tasks from internal codebases, features intentionally underspecified prompts, and covers longer multi-file tasks—better reflecting how developers actually use coding agents. CursorBench-3 shows stronger model separation at frontier levels where public benchmarks are saturated. Online evals complement the offline suite by catching regressions where outputs score well in grading but feel worse to real users. Future plans include adapting the suite for long-running agents working across sessions.

How we compare model quality in Cursor · Cursor

# CursorBench shows more separation between models

# CursorBench scores align with online evals