MiniMax M2.7 is a sparse Mixture-of-Experts (MoE) language model with 230B total parameters and only 10B active per token, featuring a 200K context window. NVIDIA details how to deploy it using vLLM and SGLang with specific inference optimizations — a fused QK RMS Norm kernel and FP8 MoE kernel — that deliver up to 2.7x
Table of contents
Building long running agents with NVIDIA NemoClawInference optimizations with open source frameworksDeploying with vLLMDeploying with SGLangBuild with NVIDIA endpointsPost-training with NVIDIA NeMo FrameworkGet started with MiniMax M2.7Sort: