Kimi Linear introduces a hybrid linear attention architecture featuring Kimi Delta Attention (KDA), a refined version of Gated DeltaNet with improved gating mechanisms. The 48B parameter model (3B activated) supports 1M token context length, reduces KV cache requirements by 75%, and achieves 6x faster decoding throughput compared to traditional attention methods. Released as open-source with model checkpoints trained on 5.7T tokens, it demonstrates superior performance on long-context tasks while maintaining efficiency through a 3:1 KDA-to-global MLA ratio.
1 Comment
Sort: