Trendyol's Helyx anomaly detection service faced critical memory bottlenecks when user adoption tripled, causing 30% of training jobs to fail with OOMKilled errors. Horizontal scaling couldn't solve the problem because individual jobs were processing massive datasets that exceeded the 16GB node limits. The team partnered with

5m read timeFrom medium.com
Post cover image
Table of contents
The “Good” Problem: Sudden GrowthHitting the Wall: The Memory BottleneckEnter the ML Platform Team and ValetThe Architectural Shift: From Worker to OrchestratorKey ImprovementsBeyond Code: Cross-Team CollaborationConclusionWe believe in the power of an inclusive workplace. Our platform is for everyone, and so is our workplace. Each and…

Sort: