MiniMax M2.7 is a sparse Mixture-of-Experts (MoE) language model with 230B total parameters and only 10B active per token, featuring a 200K context window. NVIDIA details how to deploy it using vLLM and SGLang with specific inference optimizations — a fused QK RMS Norm kernel and FP8 MoE kernel — that deliver up to 2.7x throughput improvements on NVIDIA Blackwell Ultra GPUs. The post also covers building long-running agents via NVIDIA NemoClaw and OpenShell, fine-tuning with the NeMo AutoModel library and NeMo RL, and accessing the model through NVIDIA NIM microservices or free endpoints on build.nvidia.com.

4m read timeFrom developer.nvidia.com
Post cover image
Table of contents
Building long running agents with NVIDIA NemoClawInference optimizations with open source frameworksDeploying with vLLMDeploying with SGLangBuild with NVIDIA endpointsPost-training with NVIDIA NeMo FrameworkGet started with MiniMax M2.7

Sort: