Corrosion is an open-source distributed service discovery system built with Rust, SQLite, and CRDTs that uses gossip protocols instead of distributed consensus. Developed by Fly.io to solve global state synchronization across their platform, it propagates SQLite databases using SWIM-based gossip and cr-sqlite for conflict resolution. The article details major outages caused by the system, including a deadlock bug that locked up their entire proxy fleet, and describes their iterative improvements: watchdogs for event-loop stalls, extensive testing with Antithesis, eliminating partial updates, and regionalizing clusters to reduce blast radius.
2 Comments
Sort: