ShadowKV is a new high-throughput inference system designed for long-context Large Language Models (LLMs). Developed by researchers from Carnegie Mellon University and ByteDance, it optimizes GPU memory through a low-rank key cache and offloaded value cache, allowing larger batch sizes. The system reduces decoding delays with precise sparse attention, enhances processing speed, and maintains accuracy. ShadowKV's evaluation on various benchmarks demonstrates its capability to handle significantly larger batch sizes while achieving impressive computational efficiency.
Sort: