NVIDIA Inference Transfer Library (NIXL) is an open source, vendor-agnostic data movement library designed for large-scale distributed LLM inference. It provides a unified API for high-performance point-to-point transfers across GPU memory, CPU memory, and storage tiers (NVMe, cloud object stores). Key use cases include KV cache transfers in disaggregated serving, long-context KV cache storage, weight transfer, reinforcement learning weight streaming, and elastic expert parallelism. NIXL supports backends like RDMA, GPU-initiated networking, GPUDirect storage, and cloud storage (S3, Azure Blob), and is already integrated into NVIDIA Dynamo, TensorRT-LLM, vLLM, SGLang, and others. The post covers core design concepts (agents, memory registration, metadata exchange, descriptors), a step-by-step usage walkthrough, and performance benchmarking tools (NIXLBench and KVBench).
Table of contents
What is NIXL?NIXL designExample NIXL use caseNIXL performance benchmarking toolsGet started with NVIDIA Inference Transfer LibrarySort: