Meta researchers present Generalized Dot-Product Attention (GDPA), a GPU kernel variant of standard dot-product attention that replaces softmax with custom activation functions (e.g., GELU, SiLU) for RecSys training workloads. Built on top of Flash Attention 4, the kernel addresses three core production challenges: short/asymmetric K/V sequences, jagged (variable-length) inputs, and SFU bottlenecks from transcendental functions. Key optimizations include: (1) simplified warp-specialized pipeline eliminating the softmax correction stage, (2) outer-loop software pipelining for short K/V sequences, (3) a novel zigzag tile scheduling algorithm for jagged tensors that precomputes valid tiles on CPU, and (4) an ALU-only Taylor expansion of GELU to bypass SFU bottlenecks. On NVIDIA B200 GPUs, the optimized kernel achieves up to 2× forward and 1.6× backward speedup over the Triton baseline, reaching ~97% tensor core utilization, and up to 3.5× forward speedup over FA4 under short-K/V production traffic. Applied across the full model, these kernels deliver over 30% end-to-end training throughput improvement.
Table of contents
2. Challenges in Real-World Training Workloads3. Design and Optimization of GDPA Kernels for Training4. Benchmarks5. ConclusionsAcknowledgementsReferencesSort: