NVIDIA's Hymba is a hybrid small language model that combines Transformer attention heads and Mamba state space model (SSM) heads in parallel within the same layer, rather than sequentially as in prior hybrid models. Attention heads provide high-resolution recall of specific tokens while SSM heads efficiently summarize broader context. The architecture also introduces meta tokens to guide attention focus and mitigate attention sink issues, sliding window attention for most blocks (with full attention only in first, middle, and last blocks), and cross-layer key-value cache sharing between adjacent blocks to reduce memory usage. At 1.5B parameters trained on 1.5 trillion tokens, Hymba achieves state-of-the-art results among small language models, outperforming models trained on 9+ trillion tokens.
Sort: