Meta researchers present Generalized Dot-Product Attention (GDPA), a GPU kernel variant of standard dot-product attention that replaces softmax with custom activation functions (e.g., GELU, SiLU) for RecSys training workloads. Built on top of Flash Attention 4, the kernel addresses three core production challenges:
Table of contents
2. Challenges in Real-World Training Workloads3. Design and Optimization of GDPA Kernels for Training4. Benchmarks5. ConclusionsAcknowledgementsReferencesSort: