A review of the Stanford/Synth Labs paper on Generative Reward Models, which proposes a unified framework merging RLHF and RLAIF. Unlike traditional Bradley-Terry reward models that use a linear prediction head, generative reward models produce indicator tokens (and optionally chain-of-thought reasoning) to rank LLM responses. The approach uses the Self-Taught Reasoner (STaR) method to iteratively improve the model on human feedback data, with variants using rationalization and DPO-style loss. Results show the generative reward model (particularly the STaR-DPO variant) significantly outperforms Bradley-Terry models on out-of-distribution data (RewardBench), while remaining competitive on in-distribution data.

7m watch time

Sort: