TorchSpec is a PyTorch-native framework for training speculative decoding draft models at scale using a disaggregated architecture. It separates the inference system (which generates hidden states from a large target model) from the training system (which trains the draft model), streaming tensor data between them via RDMA or TCP using the Mooncake transfer engine. This eliminates the two main bottlenecks of existing approaches: massive disk storage requirements from offline precomputation and GPU memory pressure from co-located inference and training. Using TorchSpec, the team trained a Kimi K2.5 EAGLE-3 draft model on 600k samples (6B tokens) with 1500 H200 GPU hours, achieving over 60% throughput improvement at batch size 1. The framework supports vLLM and SGLang inference engines, long-context sequences up to 200K tokens, and is open-sourced along with the trained draft model and dataset.

10m read timeFrom pytorch.org
Post cover image
Table of contents
IntroductionBackgroundTorchSpec: Disaggregated Draft Model TrainingRoadmapAcknowledgement

Sort: