Engineers at Modal profiled SGLang's scheduler while serving multimodal (VLM) inference traffic and discovered that the scheduler was repeatedly calling `torch.UntypedStorage._new_shared_cuda` to reopen CUDA IPC pool handles on every tensor, every iteration — wasting ~3% of total scheduler CPU time. The fix was a simple Python dict acting as a cache for pool handles, eliminating redundant book-keeping. The result: 16% higher throughput, 13% lower mean TTFT, and 17% lower mean TPOT on Qwen2.5-VL-3B-Instruct on an H100. The optimization is merged in SGLang v0.5.10 and can be enabled via an environment variable flag.
Table of contents
Identifying host overheadIntrospecting overhead with py-spyAvoiding book-keeping in the hot path with a Python dictSpot-checking the fix with another flamegraphMeasuring end-to-end outcomes: 16% more throughput and 10% lower latencyThis perf improvement is already in released SGLang. Try it out now!Sort: