Datadog shares operational practices for running Cilium across hundreds of Kubernetes clusters at enterprise scale. Key recommendations include using native routing over overlays, tuning IPAM parameters with surge allocation to prevent pod scheduling delays, implementing strict upgrade gates with preflight validation and connectivity tests, monitoring BPF map pressure as an early warning signal, and configuring kube-proxy replacement with Maglev hashing. The post emphasizes treating cloud APIs as critical dependencies, maintaining stable identity cardinality through label allowlists, and using kernel-level tools like bpftrace for debugging datapath issues that don't surface in standard metrics.
Table of contents
Avoid IPAM pitfalls at scaleUpgrade practices that keep deployments safeMonitor control plane and datapath signals to catch issues earlyConfigure your datapath to be reliable at scaleLessons learned from running Cilium at scaleSort: