Zero downtime Upgrade: Yelp’s Cassandra 4.x Upgrade Story

Yelp's Database Reliability Engineering team upgraded over a thousand Cassandra nodes from version 3.11 to 4.1 with zero downtime. The post covers their upgrade strategy including benchmarking (4% p99 latency improvement, 11% throughput gain, up to 58% p99 reduction on key clusters), compatibility challenges with Stargate proxy and Cassandra Source Connector, and a three-stage automated upgrade process (pre-flight, flight, post-flight). Key lessons include a Stargate 2.x regression causing slower range queries (resolved by downgrading to 1.x), schema disagreement on CDC-enabled clusters, and the value of running version-specific components in parallel during the transition. The upgrade was performed in-place via rolling restart on Kubernetes, avoiding the cost and complexity of a separate DC approach.

#kubernetes

#database

#backend

#apache-cassandra

Apr 13•12m read time•From engineeringblog.yelp.com

Table of contents

Motivation Components 1. Verifying Public Benchmarks 2. Avoid Hard Blocking 3. Seamless Upgrade 4. Production Qualification Criteria 5. Automate As Much As Possible Upgrade strategy Compatibility Changes Overall Upgrade Process Performance Impact Schema Disagreement Improved Performance Faster and More Stable Restarts Seamless Upgrade