Datadog's Metrics Summary page suffered 7-second p90 latency due to expensive joins on 82K metrics against 817K configurations in Postgres. The root cause was using a transactional database for search workloads. The solution was Change Data Capture (CDC) using Debezium to stream Postgres WAL changes into Kafka, then into a dedicated search platform. Datadog chose asynchronous replication for resilience at scale, accepting brief replication lag as a tradeoff. To handle schema evolution safely, they built automated SQL validation and a Kafka Schema Registry enforcing backward compatibility with Avro serialization. Finally, they used Temporal to automate pipeline provisioning end-to-end, turning a one-off fix into a company-wide data replication platform supporting Postgres-to-Postgres, Postgres-to-Iceberg, Cassandra, and cross-region Kafka pipelines.

9m read timeFrom blog.bytebytego.com
Post cover image
Table of contents
Your cache isn’t the problem. How you’re using it is. (Sponsored)The Database Was Simply Doing the Wrong JobWhy Async?The Problem With Schema EvolutionFrom One Pipeline to a PlatformConclusion
1 Comment

Sort: