Towards Data Science is a community-powered publication that showcases work in data science, machine learning and artificial intelligence. Every day newcomers, seasoned researchers and industry practitioners publish tutorials, research notes and real-world case studies that help the field move forward.

Towards Data Science

GPU-to-CPU data transfer can bottleneck batched ML inference workloads just as severely as CPU-to-GPU transfers. Using NVIDIA Nsight Systems profiler on a DeepLabV3 segmentation model, four optimizations are demonstrated: multi-worker output processing (58% speedup), pre-allocated buffer pools (2x speedup), asynchronous memory copies with pinned memory and CUDA events, and CUDA stream pipelining. Combined, these techniques achieve a 4x throughput improvement by eliminating GPU idle time and parallelizing memory operations with kernel execution.

Optimizing Data Transfer in Batched AI/ML Inference Workloads