Tinybird's engineering team discovered petabytes of orphaned S3 objects in their ClickHouse clusters caused by stale zero-copy replication references left behind after replica removal. After improving their garbage collector tooling, they eliminated ~45% of their cloud storage bill. However, a silent timeout during metadata snapshot collection caused the collector to incorrectly classify live data as garbage, resulting in a temporary data loss incident. Recovery was complex due to mismatches between runtime storage paths (using table UUIDs) and backup paths (using database/table names), plus complications from part mutations and multi-generation backup chains. Key lessons include the need for stronger snapshot validation, safer orchestration between collection phases, better observability, and more robust recovery procedures.

6m read timeFrom tinybird.co
Post cover image
Table of contents
How do we deal with cloud orphan objects?Why cloud objects become orphan?Why this problem becomes expensiveBuilding a garbage collectorWhen cleanup goes wrongRecovering deleted dataLessons learned
1 Comment

Sort: