NVIDIA artificially restricts peer-to-peer (P2P) GPU communication to their enterprise cards. Turns out this is a software limitation, not a hardware one. I patched my drivers to remove it, hacked vLLM to take advantage of it, and got a 15-50% throughput improvement running Qwen 3.5 35b on dual RTX 3090s.

Sam McLeod

NVIDIA artificially restricts peer-to-peer (P2P) GPU communication on consumer cards like the RTX 3090 via a software limitation. By patching the open-source NVIDIA kernel driver (aikitoria/open-gpu-kernel-modules) and bypassing vLLM's hardcoded P2P capability check, it's possible to enable direct GPU-to-GPU DMA transfers on consumer hardware. Combined with a tuned fused MoE Triton kernel config for the RTX 3090, running Qwen 3.5 35b on dual RTX 3090s yields 10-30% throughput improvement over the unpatched setup. The post covers IOMMU passthrough configuration, driver installation, vLLM Dockerfile patching, fused_moe kernel tuning, expert parallelism benchmarks, and full Docker Compose and llama.cpp configurations.

Patching NVIDIA's driver and vLLM to enable P2P on consumer GPUs