Pinterest Engineering reduced Apache Spark out-of-memory (OOM) failures by 96% across tens of thousands of daily jobs. The solution combined three approaches: improved observability with detailed executor memory metrics to identify hotspots and skewed partitions; configuration tuning of Spark settings including adaptive query execution and data skew preprocessing; and Auto Memory Retries, which automatically restarts failed jobs with increased memory allocations. The rollout was staged from ad hoc to scheduled jobs, with dashboards tracking recovered jobs and cost savings. Key operational lessons included improving scheduler performance for large TaskSets and handling Apache Gluten compatibility. Future work targets proactive memory increases for high-risk stages before failures occur.
Sort: