PyTorch engineers achieved 1.5x-2.5x speedups on Mamba-2's State-Space Dual module by fusing five separate GPU kernels into a single Triton kernel. The optimization eliminates kernel launch overhead, improves cache locality, and reduces memory traffic through careful synchronization using atomics. The fused kernel handles
Sort: