unknown

Research comparing Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for post-training foundation models reveals that SFT causes models to memorize training data and fail on out-of-distribution examples, while RL enables genuine generalization. Using Llama-3.2-Vision-11B on arithmetic and navigation tasks, experiments show SFT performance collapses from 80.8% to 1.3% on new instructions, while RL improves from 80.8% to 91.8%. RL also enhances visual perception in multimodal models, whereas increased SFT compute degrades it. However, SFT remains necessary as a "format teacher" to prepare models for RL training, and RL requires verifiable objectives, making it unsuitable for subjective tasks.

SFT Memorizes, RL Generalizes

Squad for devs building apps with GenAI and LLMs! Connect, share, exchange ideas, and get feedback.