Best of Distributed SystemsFebruary 2026

  1. 1
    Article
    Avatar of bytebytegoByteByteGo·12w

    How Uber Reinvented Access Control for Microservices

    Uber built Charter, an attribute-based access control (ABAC) system to handle authorization across thousands of microservices at microsecond latency. Traditional role-based policies couldn't express complex conditions like region-matching or ownership relationships. Charter distributes policies to services, which evaluate them locally using an embedded authfx library. Conditions are written in Google's Common Expression Language (CEL) and evaluated against attributes fetched at runtime from typed attribute stores (actor, resource, action, environment). A real-world example shows how a single ABAC policy replaced thousands of individual Kafka topic policies by dynamically checking ownership data from Uber's uOwn service. Since adoption, 70 Uber services use attribute-based policies, gaining fine-grained, dynamic, and scalable authorization without code deployments.

  2. 2
    Article
    Avatar of netflixNetflix TechBlog·14w

    Scaling LLM Post-Training at Netflix

    Netflix built an internal post-training framework to scale LLM fine-tuning from experimentation to production. The framework abstracts infrastructure complexity across four dimensions: data (streaming, sequence packing, loss masking), model (sharding, LoRA, architecture support), compute (distributed job orchestration, checkpointing, MFU monitoring), and workflow (supporting both SFT and on-policy RL). Key engineering decisions include staying Hugging Face-compatible for interoperability, maintaining optimized internal model implementations for performance, and evolving from SPMD-only execution to hybrid orchestration for RL workflows. The system enables researchers to focus on modeling rather than distributed systems plumbing.

  3. 3
    Article
    Avatar of clickhouseClickHouse·14w

    Is it over for metrics?

    Traditional metrics are shifting from the center of observability stacks to an optimization layer. While metrics remain useful for known failure modes and system-level signals like CPU and memory, they struggle with high-cardinality debugging and require pre-defining what to measure. Modern columnar databases like ClickHouse enable efficient rollups over rich, structured event data, allowing engineers to store high-fidelity logs and traces that can be aggregated on-demand. This approach moves curation from development time to investigation time, making metrics a performance optimization rather than the primary interface for understanding production systems.

  4. 4
    Article
    Avatar of confConfluent Blog·13w

    Apache Kafka 4.2.0 Released: Share Groups, Streams & More

    Apache Kafka 4.2.0 is now available, bringing several major improvements. Share Groups (Kafka Queues) are now production-ready, featuring a new RENEW acknowledgement type for extended processing, adaptive batching for share coordinators, configurable fetch record limits, and comprehensive lag metrics. Kafka Streams gains GA status for its server-side rebalance protocol, dead letter queue support in exception handlers, anchored wall-clock punctuation for deterministic scheduling, and explicit control over leave-group behavior on shutdown. Observability is improved with standardized CLI arguments, corrected metric naming following the kafka.COMPONENT convention, and new idle ratio metrics for controllers and MetadataLoader. Security enhancements include an allowlist connector client configuration override policy and thread-safety fixes to RecordHeader. Additional changes cover external schema support in JsonConverter, dynamic remote log manager thread pool configuration, adaptive batching in group coordinators, and rack ID exposure in the Admin API.