How Datadog Redefined Data Replication
Datadog's Metrics Summary page suffered 7-second p90 latency due to expensive joins on 82K metrics against 817K configurations in Postgres. The root cause was using a transactional database for search workloads. The solution was Change Data Capture (CDC) using Debezium to stream Postgres WAL changes into Kafka, then into a dedicated search platform. Datadog chose asynchronous replication for resilience at scale, accepting brief replication lag as a tradeoff. To handle schema evolution safely, they built automated SQL validation and a Kafka Schema Registry enforcing backward compatibility with Avro serialization. Finally, they used Temporal to automate pipeline provisioning end-to-end, turning a one-off fix into a company-wide data replication platform supporting Postgres-to-Postgres, Postgres-to-Iceberg, Cassandra, and cross-region Kafka pipelines.