DoorDash shares how they built a clusterless, stateless ML feature store to replace a costly Redis-plus-relational-database hybrid that couldn't scale further. The new system uses Apache Kvrocks (a Redis-protocol-compatible store backed by RocksDB) running on commodity SSD instance-store disks, with a custom Redis Cluster Manager (RCM) that provides topology transparency to clients without requiring actual cluster state sharing. Data is batch-loaded from S3 as RocksDB backups, enabling stateless horizontal scaling. A two-phase rollout (shadow validation then traffic migration) validated the design. A key discovery was that large Redis clusters degrade client performance at 2,000+ nodes, solved by having RCM return a per-client subset of nodes. The system now handles 130M HMGETs per second serving 1.6B features within a 50ms P999 latency target.
Table of contents
DoorDash ML platform overviewML feature store evolutionML feature store: A new platformArchitecture overviewStay Informed with Weekly UpdatesPlease enter a valid email address.Thank you for Subscribing!It’s not real until it’s realFinal thoughtsSort: