Databricks achieved state-of-the-art performance on the BIRD SQL reasoning benchmark using reinforcement learning with verifiable rewards (RLVR). Their approach reached 73.5% accuracy, surpassing the previous best score of 71.8% while using only open models and public data. The team fine-tuned Qwen 2.5 32B Coder Instruct using their RLVR stack and Databricks TAO offline RL method. This demonstrates RLVR's effectiveness for tasks with verifiable correctness like code generation, where rewards can be directly validated through test execution or label matching.
Sort: