FP8 precision improves deep learning performance by reducing memory bandwidth bottlenecks, but only works on newer GPUs. Feather is an open-source library that emulates FP8 on older hardware (RTX 20/30 series) through software-based bitwise packing, achieving 3.3x speedups by packing four FP8 values into a single FP32 container. The approach uses Triton GPU kernels to unpack and compute on-the-fly, trading minimal precision for significant bandwidth improvements in memory-bound operations like matrix-vector multiplication and attention mechanisms.

9m read timeFrom towardsdatascience.com
Post cover image
Table of contents
IntroductionMotivationGPU Memory HierarchyPacking MethodResults

Sort: