Read Datadog’s playbook for running Cilium across hundreds of Kubernetes clusters and learn how IPAM tuning, native routing, safe upgrades, and datapath controls influence reliability at scale.

Cilium is an innovative open-source project that redefines networking and security for containerized applications. By leveraging the power of eBPF (extended Berkeley Packet Filter), Cilium provides efficient networking and transparent security capabilities for modern cloud-native environments. Readers exploring Cilium can gain insights into advanced networking concepts, such as service mesh architectures, network policy enforcement, and distributed application security.

cilium

Datadog shares operational practices for running Cilium across hundreds of Kubernetes clusters at enterprise scale. Key recommendations include using native routing over overlays, tuning IPAM parameters with surge allocation to prevent pod scheduling delays, implementing strict upgrade gates with preflight validation and connectivity tests, monitoring BPF map pressure as an early warning signal, and configuring kube-proxy replacement with Maglev hashing. The post emphasizes treating cloud APIs as critical dependencies, maintaining stable identity cardinality through label allowlists, and using kernel-level tools like bpftrace for debugging datapath issues that don't surface in standard metrics.

Day 2 with Cilium: Small configurations that keep large clusters boring

Upgrade practices that keep deployments safe

Monitor control plane and datapath signals to catch issues early

Configure your datapath to be reliable at scale

Lessons learned from running Cilium at scale