Alibaba researchers have proposed Reward Learning on Policy (RLP), an unsupervised AI framework that refines a reward model using policy samples to keep it on-distribution. RLP enhances the safety, reliability, and effectiveness of AI-driven applications by aligning large language models with human preferences.
•4m read time• From marktechpost.com
Sort: