Engineers at Modal profiled SGLang's scheduler while serving multimodal (VLM) inference traffic and discovered that the scheduler was repeatedly calling `torch.UntypedStorage._new_shared_cuda` to reopen CUDA IPC pool handles on every tensor, every iteration — wasting ~3% of total scheduler CPU time. The fix was a simple Python dict acting as a cache for pool handles, eliminating redundant book-keeping. The result: 16% higher throughput, 13% lower mean TTFT, and 17% lower mean TPOT on Qwen2.5-VL-3B-Instruct on an H100. The optimization is merged in SGLang v0.5.10 and can be enabled via an environment variable flag.

7m read timeFrom modal.com
Post cover image
Table of contents
Identifying host overheadIntrospecting overhead with py-spyAvoiding book-keeping in the hot path with a Python dictSpot-checking the fix with another flamegraphMeasuring end-to-end outcomes: 16% more throughput and 10% lower latencyThis perf improvement is already in released SGLang. Try it out now!

Sort: