TokenSpeed, an open-source LLM inference engine by the LightSeek Foundation, achieved a record 580 tokens/second running Qwen3.5-397B-A17B on NVIDIA Blackwell B200 GPUs for agentic workloads. Key optimizations include: hybrid prefix caching for GDN/Mamba architectures with copy-on-write semantics, elimination of Mamba state tensor copies via index indirection, CUDA multi-stream parallelism overlapping shared and routed expert computation, aggressive kernel fusion (QK-RMSNorm+RoPE+gate split into one Triton kernel, AllReduce+residual+RMSNorm fusion), CUDA graph capture of the full decode loop, and asynchronous H2D transfers. The engine also supports prefill-decode disaggregation with unified RDMA-based state transfer for both KV caches and Mamba states. Benchmarks show 500+ tok/s across all parallelism configs at batch size 1, with 90%+ KV cache hit rates on multi-turn agentic workloads and only ~16% throughput degradation from 128K to 1M context length.

20m read timeFrom pytorch.org
Post cover image
Table of contents
1. Introduction2. Runtime Designs and Features3. Performance Optimizations4. Benchmark5. Conclusion6. Acknowledgements

Sort: