Cursor's team describes their 'real-time RL' approach for continuously improving Composer, their AI coding agent. Instead of relying solely on simulated environments, they collect billions of tokens from real user interactions in production, convert them into reward signals, and retrain the model — shipping a new checkpoint

6m read timeFrom cursor.com
Post cover image
Table of contents
# The train-test mismatch# A new checkpoint every five hours# Real-time RL and reward hacking# Next up: learning from longer loops and specialization

Sort: