vLLM now supports Qwen3-Next, a new foundation model featuring hybrid architecture that combines Gated DeltaNet linear attention with full attention for efficient long context processing up to 65K tokens. The model uses high-sparsity MoE with 1:50 activation ratio, activating only 3B out of 80B parameters per token, and includes multi-token prediction capabilities. vLLM implements specialized optimizations including Triton kernels from Flash Linear Attention, hybrid KV cache management, and CUDA graph mode for improved performance.

3m read timeFrom blog.vllm.ai
Post cover image
Table of contents
QuickstartHybrid Attention: Efficient Context ModelingHigh-Sparsity MoE: Extreme EfficiencyMulti-Token Prediction (MTP)Looking AheadAcknowledgements

Sort: