vLLM v0.18.0 introduces a native hidden states extraction system via PR #33736. The feature enables efficient extraction of internal model representations needed for training speculative decoding draft models (e.g., Eagle-3, P-Eagle, DFlash). The design reuses existing Eagle-3 plumbing and the KV Connector API: a dummy draft model captures verifier hidden states into a dummy KV cache, which a custom connector then writes to disk or transfers elsewhere. This avoids the overhead of using raw Transformers or heavily patching vLLM internals. The system supports tensor and data parallelism, prefix caching, and chunked prefill. Current limitations include only an example disk-based connector, with async and device-to-device connectors planned. The Speculators library (v0.5.0) is being updated to use this new system for online draft model training.
Sort: