Alibaba's START (Self-Taught Reasoner with Tools) paper demonstrates how LLMs can integrate Python execution into their chain-of-thought reasoning. The approach injects strategic 'hints' during inference to prompt the model to write and run Python code, then refine answers based on execution results. Training involves two phases: first, a seed dataset of 12,000 samples is curated via hint-based rejection sampling (keeping only cases where tool-augmented inference succeeds but standard inference fails), then a larger 50,000-sample dataset is used for a second fine-tuning phase. The resulting START model, based on QwQ-32B, consistently outperforms same-size baselines and even beats OpenAI's o1-preview and o1-mini on math and coding benchmarks.
Sort: