If we've said it once, we've said it once per millisecond: never block the GPU.

Modal

Engineers at Modal profiled SGLang's scheduler while serving multimodal (VLM) inference traffic and discovered that the scheduler was repeatedly calling `torch.UntypedStorage._new_shared_cuda` to reopen CUDA IPC pool handles on every tensor, every iteration — wasting ~3% of total scheduler CPU time. The fix was a simple Python dict acting as a cache for pool handles, eliminating redundant book-keeping. The result: 16% higher throughput, 13% lower mean TTFT, and 17% lower mean TPOT on Qwen2.5-VL-3B-Instruct on an H100. The optimization is merged in SGLang v0.5.10 and can be enabled via an environment variable flag.

Boosting multimodal inference performance by >10% with a single Python dictionary

Avoiding book-keeping in the hot path with a Python dict

Spot-checking the fix with another flamegraph

Measuring end-to-end outcomes: 16% more throughput and 10% lower latency

This perf improvement is already in released SGLang. Try it out now!