Alibaba Researchers Propose Reward Learning on Policy (RLP): An Unsupervised AI Framework that Refines a Reward Model Using Policy Samples to Keep it on-Distribution

We are a community of AI/ ML/Generative AI enthusiasts/researchers/journalists/writers who share interesting news and articles about the applications of AI. 

Machine Learning News

Alibaba researchers have proposed Reward Learning on Policy (RLP), an unsupervised AI framework that refines a reward model using policy samples to keep it on-distribution. RLP enhances the safety, reliability, and effectiveness of AI-driven applications by aligning large language models with human preferences.