Meta AI released a research paper on Self-Rewarding Language Models, coinciding with Mark Zuckerberg's announcement of Meta's goal to build open-source AGI. The approach enables a single LLM to act as both the instruction-following model and its own reward model. Starting from a base model (Llama 2 70B) fine-tuned on instruction and evaluation datasets, the method iteratively generates prompts, produces multiple responses, self-scores them, and trains using DPO on preference pairs. Experiments show progressive improvement across iterations (M1→M3), with M3 achieving ~20% win rate against GPT-4 Turbo — outperforming several strong baselines. Both instruction-following ability and reward modeling accuracy improve with each iteration, though the impact of further iterations remains an open research question.
Sort: