DeepSeek-R1, an open model leveraging Group Relative Policy Optimization (GRPO) and reinforcement learning, demonstrates significant advancements in reasoning tasks. This tutorial reproduces a simplified version of DeepSeek-R1's training using GRPO on the Countdown Game, showcasing the model's self-verification and search capabilities. The training is computationally intensive, requiring distributed setups for efficiency. Detailed instructions and scripts are provided for setting up the environment, generating training samples, and performing distributed training.
Table of contents
1. Setup the development environment2. Generate training samples with reasoning prefix from the Countdown Game3. Train the model using GRPO (Educational part)4. Distributed Training example for GRPO using Deepspeed and vLLM5. Results and Training ObservationsConclusionSort: