Microsoft's rStar-Math paper demonstrates that small language models (SLMs) can rival OpenAI's o1 model in mathematical reasoning by applying System 2 deep thinking via Monte Carlo Tree Search (MCTS). The framework uses two models: a policy model that generates reasoning step options and a process preference model (PPM) that selects the best steps using Q-values. A key innovation is code-augmented chain-of-thought, which pairs natural language reasoning steps with executable Python code to verify intermediate correctness. The system self-evolves over four training rounds using 747k competition-level math problems, bootstrapping from a 236B parameter model before transitioning to smaller models. The resulting 7B parameter model is competitive with or surpasses OpenAI o1-preview on math benchmarks.

10m watch time

Sort: