monday.com's engineering team shares how they use Chaos Engineering practices with Chaos Mesh to proactively test system resilience in staging. The post covers three core scenarios: network partitioning to validate circuit breakers and timeouts, pod killing to test service recovery and Kubernetes probes, and DNS chaos to simulate external dependency failures. A real example is shared where chaos testing revealed a misconfigured timeout causing thread exhaustion instead of graceful degradation. The post also outlines best practices for controlling blast radius using Chaos Mesh selectors and advocates for eventually moving controlled experiments into production.

5m read timeFrom engineering.monday.com
Post cover image
Table of contents
Core Principles of Chaos EngineeringC haos-MeshControlling the Blast RadiusWhen Assumptions FailedEmbracing Failure to Ensure Success

Sort: