Corrosion is an open-source distributed service discovery system built with Rust, SQLite, and CRDTs that uses gossip protocols instead of distributed consensus. Developed by Fly.io to solve global state synchronization across their platform, it propagates SQLite databases using SWIM-based gossip and cr-sqlite for conflict resolution. The article details major outages caused by the system, including a deadlock bug that locked up their entire proxy fleet, and describes their iterative improvements: watchdogs for event-loop stalls, extensive testing with Antithesis, eliminating partial updates, and regionalizing clusters to reduce blast radius.

11m read timeFrom fly.io
Post cover image
Table of contents
Our Face-Seeking RakeCorrosionShit HappensIterationThe New System Works
2 Comments

Sort: