Zalando's Search & Browse team experienced a self-inflicted DoS attack when an internal application sent resource-intensive faceting queries on high-cardinality fields to their Elasticsearch cluster. The incident caused search slowdowns and empty results for customers. The team mitigated by splitting markets across clusters, implementing load shedding, and eventually traced the issue to a maintenance workload bug generating 50x normal query volume. Key lessons included improving per-client monitoring with X-Opaque-Id headers, implementing query-level rate limiting, adding aggregation size controls, and recognizing that performance issues can stem from unexpected sources rather than common causes.

14m read timeFrom engineering.zalando.com
Post cover image
Table of contents
Who We AreAnthology of the System Under High LoadThe IncidentImmediate Actions TakenThe MarketsAdditional Load Shedding: Making the Cluster Breathe AgainNew Investigation and Finally, Root CauseBefore the Dawn: Cluster RecoveryThe RevelationWhy wasn't this detected earlier?Some theory on Elasticsearch DoS via Faceting Queries on High Cardinality FieldsFollow-up Actions and Lessons LearnedUseful Links
2 Comments

Sort: