Yelp's Database Reliability Engineering team upgraded over a thousand Cassandra nodes from version 3.11 to 4.1 with zero downtime. The post covers their upgrade strategy including benchmarking (4% p99 latency improvement, 11% throughput gain, up to 58% p99 reduction on key clusters), compatibility challenges with Stargate proxy and Cassandra Source Connector, and a three-stage automated upgrade process (pre-flight, flight, post-flight). Key lessons include a Stargate 2.x regression causing slower range queries (resolved by downgrading to 1.x), schema disagreement on CDC-enabled clusters, and the value of running version-specific components in parallel during the transition. The upgrade was performed in-place via rolling restart on Kubernetes, avoiding the cost and complexity of a separate DC approach.
Table of contents
MotivationComponents1. Verifying Public Benchmarks2. Avoid Hard Blocking3. Seamless Upgrade4. Production Qualification Criteria5. Automate As Much As PossibleUpgrade strategyCompatibility ChangesOverall Upgrade ProcessPerformance ImpactSchema DisagreementImproved PerformanceFaster and More Stable RestartsSeamless Upgrade2 Comments
Sort: