Best of NetflixSeptember 2025

  1. 1
    Article
    Avatar of netflixNetflix TechBlog·30w

    Building a Resilient Data Platform with Write-Ahead Log at Netflix

    Netflix built a generic Write-Ahead Log (WAL) system to solve data consistency and reliability challenges at scale. The system provides a simple API that abstracts underlying message queues (Kafka, SQS) and supports multiple use cases including delayed queues, cross-region replication, and multi-partition mutations. WAL prevents data loss, handles system entropy across different datastores, and enables reliable retry mechanisms for real-time data pipelines. The architecture separates message producers from consumers, uses configurable namespaces for logical separation, and leverages Netflix's Data Gateway infrastructure for deployment. Key applications include EVCache cross-region replication, Live Origin's delayed delete operations, and Key-Value service's MutateItems API with two-phase commit semantics.

  2. 2
    Article
    Avatar of bytebytegoByteByteGo·32w

    How Netflix Tudum Supports 20 Million Users With CQRS

    Netflix redesigned their Tudum platform architecture to support 20 million users by replacing a traditional CQRS implementation with RAW Hollow, an in-memory object store. The original design used Kafka and Cassandra with caching layers, causing delays in editorial previews due to eventual consistency. By embedding RAW Hollow directly into microservices, they eliminated external datastores and reduced page construction time from 1.4 seconds to 0.4 seconds while enabling near-instant editorial previews. The compressed in-memory approach stores three years of data in just 130MB while maintaining strong consistency options for critical workflows.

  3. 3
    Article
    Avatar of netflixNetflix TechBlog·31w

    Empowering Netflix Engineers with Incident Management

    Netflix transformed their incident management from a centralized SRE-only process to a democratized approach where all engineering teams can declare and manage incidents. They adopted Incident.io as their platform, focusing on intuitive design, internal data integration, balanced customization, and organizational investment in training. This shift resulted in 50% adoption across engineering teams within six months and fostered a culture where incidents are viewed as learning opportunities rather than scary outages.