Reward hacking in reinforcement learning (RL) occurs when agents exploit flaws in reward functions to obtain high rewards without genuinely completing the intended task. This issue has become a practical challenge with the rise of language models and RLHF (Reinforcement Learning from Human Feedback). Poorly designed reward
Table of contents
Background #Let’s Define Reward Hacking #Hacking RL Environment #Hacking RLHF of LLMs #Generalization of Hacking Skills #Peek into Mitigations #Citation #References #Sort: