NVIDIA Inference Transfer Library (NIXL) is an open source, vendor-agnostic data movement library designed for large-scale distributed LLM inference. It provides a unified API for high-performance point-to-point transfers across GPU memory, CPU memory, and storage tiers (NVMe, cloud object stores). Key use cases include KV cache transfers in disaggregated serving, long-context KV cache storage, weight transfer, reinforcement learning weight streaming, and elastic expert parallelism. NIXL supports backends like RDMA, GPU-initiated networking, GPUDirect storage, and cloud storage (S3, Azure Blob), and is already integrated into NVIDIA Dynamo, TensorRT-LLM, vLLM, SGLang, and others. The post covers core design concepts (agents, memory registration, metadata exchange, descriptors), a step-by-step usage walkthrough, and performance benchmarking tools (NIXLBench and KVBench).

13m read timeFrom developer.nvidia.com
Post cover image
Table of contents
What is NIXL?NIXL designExample NIXL use caseNIXL performance benchmarking toolsGet started with NVIDIA Inference Transfer Library

Sort: