GPU-to-CPU data transfer can bottleneck batched ML inference workloads just as severely as CPU-to-GPU transfers. Using NVIDIA Nsight Systems profiler on a DeepLabV3 segmentation model, four optimizations are demonstrated: multi-worker output processing (58% speedup), pre-allocated buffer pools (2x speedup), asynchronous memory copies with pinned memory and CUDA events, and CUDA stream pipelining. Combined, these techniques achieve a 4x throughput improvement by eliminating GPU idle time and parallelizing memory operations with kernel execution.

14m read timeFrom towardsdatascience.com
Post cover image
Table of contents
ResultsSummary

Sort: