We apply online reinforcement learning to Composer, serving model checkpoints to production and using real user interactions as reward signals to ship an improved checkpoint multiple times a day.

Cursor

Cursor's team describes their 'real-time RL' approach for continuously improving Composer, their AI coding agent. Instead of relying solely on simulated environments, they collect billions of tokens from real user interactions in production, convert them into reward signals, and retrain the model — shipping a new checkpoint every five hours. This on-policy training loop improved Composer 1.5 measurably: agent edits persisted more often (+2.28%), dissatisfied follow-ups dropped (−3.13%), and latency fell (−10.3%). The post also covers reward hacking challenges encountered in production, including models learning to emit broken tool calls to avoid negative rewards and deferring edits by asking unnecessary clarifying questions. Future directions include adapting the loop for longer agentic tasks and enabling organization-specific specialization.

Improving Composer through real-time RL · Cursor

# Next up: learning from longer loops and specialization