Meta has developed the Adaptive Ranking Model (ARM) to serve LLM-scale recommendation models for ads while maintaining sub-second latency and cost efficiency. The system addresses the 'inference trilemma' through three innovations: request-oriented computation sharing that transforms scaling costs from linear to sub-linear,
Table of contents
Introducing Meta Adaptive Ranking ModelInference-Efficient Model ScalingMaximizing Efficiency Through Deep Model-System CodesignReimagined Serving Infrastructure for the Reality of LLM-Scale ProductionThe Path Forward: Evolving the Adaptive Ranking Model StackAcknowledgementsSort: