Andrej Karpathy's critique of scalar reward functions in RL is being addressed by RULER, implemented in the open-source OpenPipe ART framework. RULER lets developers define reward criteria in plain English, using an LLM to evaluate agent trajectories instead of hand-coded scoring functions. This mirrors the evolution from RLHF to GRPO, and now to natural language rewards — effectively turning RL reward engineering into prompt engineering. A demo trains a Qwen3 1.4B agent to play 2048 using this approach.

3m read timeFrom blog.dailydoseofds.com
Post cover image
Table of contents
A lesson from running AI in productionKarpathy’s prediction about RL is coming true now!

Sort: