TorchSpec: Speculative Decoding Training at Scale – PyTorch

TorchSpec is a PyTorch-native framework for training speculative decoding draft models at scale using a disaggregated architecture. It separates the inference system (which generates hidden states from a large target model) from the training system (which trains the draft model), streaming tensor data between them via RDMA or TCP using the Mooncake transfer engine. This eliminates the two main bottlenecks of existing approaches: massive disk storage requirements from offline precomputation and GPU memory pressure from co-located inference and training. Using TorchSpec, the team trained a Kimi K2.5 EAGLE-3 draft model on 600k samples (6B tokens) with 1500 H200 GPU hours, achieving over 60% throughput improvement at batch size 1. The framework supports vLLM and SGLang inference engines, long-context sequences up to 200K tokens, and is open-sourced along with the trained draft model and dataset.

#llm

#pytorch

#vllm

Mar 19•10m read time•From pytorch.org

Table of contents

Introduction Background TorchSpec: Disaggregated Draft Model Training Roadmap Acknowledgement

Comment

Bookmark

Copy

Sort: