A deep dive into how top AI labs are evolving reinforcement learning for LLMs, tracing the path from RLHF and PPO through DeepSeek's RLVR+GRPO breakthrough to the current challenge of reward signals for non-verifiable agentic tasks. The post introduces RULER, a component of OpenPipe's open-source ART framework, which replaces custom reward functions with an LLM-as-judge that scores trajectory groups relatively — enabling GRPO-based RL training for RAG agents, customer support bots, summarization, and other tasks where deterministic verifiers don't exist. Concrete code examples show how to score trajectory groups, combine LLM-judge scores with binary verifiers, and use natural-language rubrics instead of brittle Python reward functions.
Table of contents
Applying RL to LLMsDeepSeek R1 breakthrough using verifiable rewardsThe problemHow are AI labs approaching this?RULERA rough walkthroughTrajectories and GroupsTwo concrete examplesThe full training loopCustom rubricsApplication to non-verifiable tasksPractical detailsSort: