We’re excited to announce that vLLM now supports Qwen3-Next, the latest generation of foundation models from the Qwen team. Qwen3-Next introduces a hybrid architecture with extreme efficiency for long context support, and vLLM offers full support of its functionalities.

vLLM

vLLM now supports Qwen3-Next, a new foundation model featuring hybrid architecture that combines Gated DeltaNet linear attention with full attention for efficient long context processing up to 65K tokens. The model uses high-sparsity MoE with 1:50 activation ratio, activating only 3B out of 80B parameters per token, and includes multi-token prediction capabilities. vLLM implements specialized optimizations including Triton kernels from Flash Linear Attention, hybrid KV cache management, and CUDA graph mode for improved performance.

vLLM Now Supports Qwen3-Next: Hybrid Architecture with Extreme Efficiency

Hybrid Attention: Efficient Context Modeling