A Node.js service faced performance issues due to improper handling of worker threads, which caused high resource consumption and server instability within a Kubernetes environment. By spawning multiple workers per CPU core instead of per allocated resource, and aggressively restarting them on errors, a positive feedback loop overwhelmed both the campaign and translation services. Investigation revealed that limiting worker threads and proper resource allocation could resolve the issue, highlighting the importance of optimized worker management and enhanced observability in production environments.

12m read timeFrom engineering.zalando.com
Post cover image
Table of contents
A disrupted gaming nightNot so fast TarnishedDigging deeperBuilding better observability
3 Comments

Sort: