Self-Play Preference Optimization (SPPO) is a robust method for fine-tuning Large Language Models (LLMs) using Human/AI Feedback. It significantly improves over existing methods like DPO and IPO across various benchmarks.
•3m read time• From marktechpost.com
Sort: