MiniMax M2.7 is a sparse Mixture-of-Experts (MoE) language model with 230B total parameters and only 10B active per token, featuring a 200K context window. NVIDIA details how to deploy it using vLLM and SGLang with specific inference optimizations — a fused QK RMS Norm kernel and FP8 MoE kernel — that deliver up to 2.7x

4m read timeFrom developer.nvidia.com
Post cover image
Table of contents
Building long running agents with NVIDIA NemoClawInference optimizations with open source frameworksDeploying with vLLMDeploying with SGLangBuild with NVIDIA endpointsPost-training with NVIDIA NeMo FrameworkGet started with MiniMax M2.7

Sort: