DrP is Meta's automated root cause analysis platform that programmatically investigates incidents in large-scale systems. It provides an SDK for creating investigation playbooks (analyzers), a scalable backend for execution, and integrations with alerting and incident management tools. Used by over 300 teams at Meta, DrP runs 50,000 analyses daily and has reduced mean time to resolve (MTTR) by 20-80%. The platform includes ML algorithms for anomaly detection, time series correlation, and dimension analysis, with automated post-processing for mitigation actions. Meta plans to evolve DrP into an AI-native platform as part of their broader AI4Ops vision.

7m read timeFrom engineering.fb.com
Post cover image
Table of contents
What It IsHow It WorksWhy It MattersWhat’s NextRead the PaperAcknowledgements

Sort: