vLLM now supports Qwen3-Next, a new foundation model featuring hybrid architecture that combines Gated DeltaNet linear attention with full attention for efficient long context processing up to 65K tokens. The model uses high-sparsity MoE with 1:50 activation ratio, activating only 3B out of 80B parameters per token, and includes multi-token prediction capabilities. vLLM implements specialized optimizations including Triton kernels from Flash Linear Attention, hybrid KV cache management, and CUDA graph mode for improved performance.
Table of contents
QuickstartHybrid Attention: Efficient Context ModelingHigh-Sparsity MoE: Extreme EfficiencyMulti-Token Prediction (MTP)Looking AheadAcknowledgementsSort: