Post-training is a crucial phase in LLM development that teaches models conversational skills and reasoning abilities through techniques like Supervised Fine Tuning (SFT), Direct Preference Optimization (DPO), and Reinforcement Learning from Human Feedback (RLHF). The guide covers the technical implementation details of these methods, including PPO algorithm mechanics, reward modeling strategies, and infrastructure considerations. It also explores recent advances in test-time reasoning and provides practical code examples for implementing these techniques in PyTorch.

36m read timeFrom pytorch.org
Post cover image
Table of contents
Primer on post-trainingPost-training data formatPost-training techniquesSFT: Supervised Fine TuningBeyond RLHF: a general paradigmTest-time compute and reasoningAppendix A: Diving deeper into PPO

Sort: