Up to 580tps! New Speed Record of Qwen3.5-397B-A17B on GPU for Agentic Workloads with TokenSpeed – PyTorch

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

TokenSpeed, an open-source LLM inference engine by the LightSeek Foundation, achieved a record 580 tokens/second running Qwen3.5-397B-A17B on NVIDIA Blackwell B200 GPUs for agentic workloads. Key optimizations include: hybrid prefix caching for GDN/Mamba architectures with copy-on-write semantics, elimination of Mamba state tensor copies via index indirection, CUDA multi-stream parallelism overlapping shared and routed expert computation, aggressive kernel fusion (QK-RMSNorm+RoPE+gate split into one Triton kernel, AllReduce+residual+RMSNorm fusion), CUDA graph capture of the full decode loop, and asynchronous H2D transfers. The engine also supports prefill-decode disaggregation with unified RDMA-based state transfer for both KV caches and Mamba states. Benchmarks show 500+ tok/s across all parallelism configs at batch size 1, with 90%+ KV cache hit rates on multi-turn agentic workloads and only ~16% throughput degradation from 128K to 1M context length.

#pytorch

#agentic-ai

#cuda

#ai-inference

#qwen

May 27•20m read time•From pytorch.org

Table of contents

1. Introduction 2. Runtime Designs and Features 3. Performance Optimizations 4. Benchmark 5. Conclusion 6. Acknowledgements

Comment

Bookmark

Copy

Sort: