Hunting orphan objects: 45% off our ClickHouse storage bill (and a near data-loss incident)

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

Tinybird's engineering team discovered petabytes of orphaned S3 objects in their ClickHouse clusters caused by stale zero-copy replication references left behind after replica removal. After improving their garbage collector tooling, they eliminated ~45% of their cloud storage bill. However, a silent timeout during metadata snapshot collection caused the collector to incorrectly classify live data as garbage, resulting in a temporary data loss incident. Recovery was complex due to mismatches between runtime storage paths (using table UUIDs) and backup paths (using database/table names), plus complications from part mutations and multi-generation backup chains. Key lessons include the need for stronger snapshot validation, safer orchestration between collection phases, better observability, and more robust recovery procedures.

#finops

#clickhouse

May 19•6m read time•From tinybird.co

Table of contents

How do we deal with cloud orphan objects?Why cloud objects become orphan?Why this problem becomes expensive Building a garbage collector When cleanup goes wrong Recovering deleted data Lessons learned