A developer shares their solution for the One Billion Row Challenge in CUDA, achieving a runtime of 16.8 seconds on a V100 GPU. The solution involves work partitioning and using byte offsets instead of line buffers. The CUDA kernel handles parsing and updating statistics. The post discusses challenges such as adapting atomic operations for floats, working with C strings, and creating a city string to index lookup table. The solution is a significant improvement over the pure C++ baseline.
Table of contents
Baseline in pure C++Work Partitioning ApproachCUDA KernelProfilingPossible Optimization - Privatization using Shared MemoryTakeawaySort: