Karpathy’s Prediction About RL is Coming True Now!
This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).
Andrej Karpathy's critique of scalar reward functions in RL is being addressed by RULER, implemented in the open-source OpenPipe ART framework. RULER lets developers define reward criteria in plain English, using an LLM to evaluate agent trajectories instead of hand-coded scoring functions. This mirrors the evolution from RLHF to GRPO, and now to natural language rewards — effectively turning RL reward engineering into prompt engineering. A demo trains a Qwen3 1.4B agent to play 2048 using this approach.
Table of contents
A lesson from running AI in productionKarpathy’s prediction about RL is coming true now!Sort: