In this video we dive into Generative Reward Models, introduced in a fascinating recent AI research paper by Stanford University. As the name implies, Generative Reward Models (GenRM) represent a potential improvement to the reward models currently used in Reinforcement Learning. 

Notably, some authors of this paper are also behind the Direct Preference Optimization (DPO) paper, which does not rely on a reward model. However, recent research indicates that solely using non-reward model methods may not be optimal for all scenarios.

Large language models (LLMs) are enhanced through Reinforcement Learning from Human Feedback (RLHF) or Reinforcement Learning from AI Feedback (RLAIF), both of which typically utilize reward models to train on feedback from humans or AI systems.

We start the video with a brief background on Large Language Models (LLMs) training process, specifically focusing on RLHF and RLAIF.

The Generative Reward Models (GenRM) paper suggests a unified approach that combines RLHF and RLAIF together, to generalize better on out-of-distribution data, and to create reward models that are better aligned with user feedback.
An important method which is used to create GenRMs is Self-Taught Reasoner (STaR), which we also review in the video.

Paper page - https://arxiv.org/abs/2410.12832

---------------------------------------------------------------------------------------------
Support us - https://paypal.me/aipapersacademy

✉️ Join the newsletter - https://aipapersacademy.com/newsletter/

👍 Please like & subscribe if you enjoy this content
---------------------------------------------------------------------------------------------

Chapters:
0:00 Introduction
0:46 LLM Training
2:08 RLHF & RLAIF
4:00 Generative Reward Models (GenRM)
5:34 Self-Taught Reasoner (STaR)
7:05 Results

AI Papers Academy

A review of the Stanford/Synth Labs paper on Generative Reward Models, which proposes a unified framework merging RLHF and RLAIF. Unlike traditional Bradley-Terry reward models that use a linear prediction head, generative reward models produce indicator tokens (and optionally chain-of-thought reasoning) to rank LLM responses. The approach uses the Self-Taught Reasoner (STaR) method to iteratively improve the model on human feedback data, with variants using rationalization and DPO-style loss. Results show the generative reward model (particularly the STaR-DPO variant) significantly outperforms Bradley-Terry models on out-of-distribution data (RewardBench), while remaining competitive on in-distribution data.

Generative Reward Models: Merging the Power of RLHF and RLAIF for Smarter AI