Daily Dose of Data Science | Avi Chawla | Substack

How Top AI Labs Are Building RL Agents in 2026

A deep dive into how top AI labs are evolving reinforcement learning for LLMs, tracing the path from RLHF and PPO through DeepSeek's RLVR+GRPO breakthrough to the current challenge of reward signals for non-verifiable agentic tasks. The post introduces RULER, a component of OpenPipe's open-source ART framework, which replaces custom reward functions with an LLM-as-judge that scores trajectory groups relatively — enabling GRPO-based RL training for RAG agents, customer support bots, summarization, and other tasks where deterministic verifiers don't exist. Concrete code examples show how to score trajectory groups, combine LLM-judge scores with binary verifiers, and use natural-language rubrics instead of brittle Python reward functions.

#ai-agents

#reinforcement-learning

Apr 27•23m read time•From blog.dailydoseofds.com

Table of contents

Applying RL to LLMs DeepSeek R1 breakthrough using verifiable rewards The problem How are AI labs approaching this?RULER A rough walkthrough Trajectories and Groups Two concrete examples The full training loop Custom rubrics Application to non-verifiable tasks Practical details

Comment

Bookmark

Copy

Sort: