Mamba-3 is a new state space model (SSM) designed with inference efficiency as the primary goal, contrasting with Mamba-2's training-speed focus. Key improvements include a more expressive recurrence via exponential-trapezoidal discretization, complex-valued state tracking, and a MIMO (multi-input, multi-output) variant that boosts accuracy without increasing decode latency. At the 1.5B scale, Mamba-3 SISO outperforms Mamba-2, Gated DeltaNet, and Llama-3.2-1B on prefill+decode latency across all sequence lengths. The architecture also removes the short causal convolution from Mamba-1/2, adds QKNorm for training stability, and incorporates RoPE for complex-valued SSM representation. Kernels are open-sourced using a mix of Triton, TileLang, and CuTe DSL for maximum hardware performance on Hopper GPUs.

16m read timeFrom together.ai
Post cover image
Table of contents
The Mamba-3 modelArchitectureEmpirical resultsKernels here, there, and everywhereNext upReferences

Sort: