In-Kernel Broadcast Optimization: Co-Designing Kernels for RecSys Inference – PyTorch

In-Kernel Broadcast Optimization (IKBO) is a kernel-model-system co-design technique developed at Meta that eliminates redundant user-embedding broadcast in RecSys inference. Instead of replicating shared user embeddings across all candidates before interaction layers, IKBO encodes broadcast semantics directly into GPU kernels, so no replicated tensors ever materialize. Two kernel deep dives are presented: Linear Compression (achieving ~4× speedup on H100 SXM5 through matmul decomposition, memory alignment, broadcast fusion, and warp-specialized multi-stage fusion via TLX) and Flash Attention (shifting from IO-bound to compute-bound, reaching 621 BF16 TFLOPs and 2.4×/6.4× throughput over non-co-designed CuTeDSL FA4-Hopper). Deployed across Meta's full RecSys inference stack on both GPU and MTIA accelerators, IKBO delivers up to 2/3 reduction in compute-intensive net latency and serves as the scalability backbone for the Meta Adaptive Ranking Model.

#pytorch

#recommendation-systems

May 05•33m read time•From pytorch.org

Table of contents

1. In-Kernel Broadcast Optimization: Eliminating Memory and Compute Redundancy 2. Kernel Deep Dive I: IKBO Linear Compression 3. Kernel Deep Dive II: IKBO Flash Attention 4. Summary of Benchmarks and Results 5. Conclusion and Future Directions Acknowledgements References Appendix

Comment

Bookmark

Copy

Sort: