In-Kernel Broadcast Optimization (IKBO) is a kernel-model-system co-design technique developed at Meta that eliminates redundant user-embedding broadcast in RecSys inference. Instead of replicating shared user embeddings across all candidates before interaction layers, IKBO encodes broadcast semantics directly into GPU kernels, so no replicated tensors ever materialize. Two kernel deep dives are presented: Linear Compression (achieving ~4× speedup on H100 SXM5 through matmul decomposition, memory alignment, broadcast fusion, and warp-specialized multi-stage fusion via TLX) and Flash Attention (shifting from IO-bound to compute-bound, reaching 621 BF16 TFLOPs and 2.4×/6.4× throughput over non-co-designed CuTeDSL FA4-Hopper). Deployed across Meta's full RecSys inference stack on both GPU and MTIA accelerators, IKBO delivers up to 2/3 reduction in compute-intensive net latency and serves as the scalability backbone for the Meta Adaptive Ranking Model.
Table of contents
1. In-Kernel Broadcast Optimization: Eliminating Memory and Compute Redundancy2. Kernel Deep Dive I: IKBO Linear Compression3. Kernel Deep Dive II: IKBO Flash Attention4. Summary of Benchmarks and Results5. Conclusion and Future DirectionsAcknowledgementsReferencesAppendixSort: