Cursor's team describes their 'real-time RL' approach for continuously improving Composer, their AI coding agent. Instead of relying solely on simulated environments, they collect billions of tokens from real user interactions in production, convert them into reward signals, and retrain the model — shipping a new checkpoint

Table of contents
# The train-test mismatch# A new checkpoint every five hours# Real-time RL and reward hacking# Next up: learning from longer loops and specializationSort: