Robert Youssef

MIT researchers developed a method that allows machine learning models to acquire new capabilities through fine-tuning without experiencing catastrophic forgetting, where models lose previously learned skills. The approach eliminates the need for reward functions, addressing a fundamental challenge in model training where teaching new tasks typically causes models to forget existing knowledge.

MIT figured out how to make models learn new skills without forgetting old ones. no reward function needed. 🤯

the core problem with fine-tuning has always been catastrophic forgetting.

you teach a model to use tools, it forgets how to do science. you teach it medicine, it forgets the tools.

supervised fine-tuning is inherently off-policy. you're forcing the model to imitate fixed examples. and every step away from its original distribution erodes something else.

the standard fix is reinforcement learning. train on the model's own outputs so it stays on-policy. but rl needs a reward function. and reward functions are either expensive, brittle, or both.

MIT's insight is deceptively simple.

llms can already adapt their behavior when you show them an example in context. that's in-context learning. no weight updates needed. so what if you used that ability to create a teacher signal?

same model, two roles. teacher sees the query plus a demonstration. student sees only the query. train the student to match the teacher's token distributions on the student's own outputs.

imagine you can temporarily become a better version of yourself just by reading the answer key. you don't copy the answers. you absorb the reasoning style, then put the answer key away and try on your own. the "wiser you" guides the "regular you." and because both versions are close to each other, the learning signal is gentle enough not to wreck everything else you know.

results back this up. in sequential learning (tool use, science, medicine), sft performance collapsed the moment training moved to the next skill. sdft retained all three. no regression.

on knowledge acquisition, sdft hit 89% strict accuracy vs sft's 80%. out-of-distribution: 98% vs 80%. that ood gap is the real story. sft memorized answers. sdft actually integrated the knowledge.

the theoretical grounding is elegant. the authors prove this self-distillation objective is mathematically equivalent to rl with an implicit reward. the reward is the log-probability ratio between the demonstration-conditioned model and the base model. no hand-crafted reward.

the model's own in-context learning defines what "good" looks like. it's inverse rl without ever explicitly learning a reward.

scaling behavior is worth noting. at 3B parameters, sdft actually underperforms sft. the model's in-context learning is too weak. at 7B, 4-point advantage. at 14B, 7 points. the method gets better as models get smarter. it's going to matter more at frontier scale, not less.

limitations are real and worth reading. 2.5x compute cost vs sft. the student sometimes inherits teacher artifacts. doesn't work for fundamental behavioral shifts. requires strong in-context learning, so small models are out. these are real constraints, not footnotes.

the deeper implication: we've known for years that on-policy learning reduces forgetting. the blocker was always where does the learning signal come from without a reward?

this paper's answer: from the model itself. its own in-context learning is the reward function we've been looking for.

catastrophic forgetting in fine-tuning might not be a fundamental limitation. it might be a self-inflicted consequence of off-policy training.

MIT researchers developed a method enabling AI models to learn new skills without forgetting previously learned ones, eliminating the need for a reward function. The approach addresses the core challenge of catastrophic forgetting in continual learning.