Pinterest reduced out-of-memory errors in Apache Spark by 96% through Auto Memory Retries, a feature that automatically identifies memory-intensive tasks and retries them on larger executors. The system uses a hybrid approach: first doubling CPU allocation per task to give it more memory on existing executors, then launching physically larger executors if needed. The implementation extends core Spark classes to support task-level resource profiles with 2x, 3x, and 4x scaling factors. This eliminated OOM failures across 90k+ daily Spark jobs processing hundreds of petabytes, reducing on-call load and freeing resources without additional infrastructure costs.

15m read timeFrom medium.com
Post cover image
Table of contents
Spark PlatformProblem IdentificationImplementationGet Pinterest Engineering’s stories in your inboxRollout & MonitoringResultsLearningsFutureConclusionAcknowledgementsReferences

Sort: