Datadog engineers share six hard-won lessons from five years of running eBPF-powered workload protection at scale. The post covers: navigating kernel version and distribution compatibility pitfalls (verifier changes, hook point availability, function inlining); safely capturing kernel data (CO-RE offsets, TOCTOU races, paged-out user memory, non-linear skbs); maintaining consistent eBPF maps and user-space caches (LRU eviction quirks, blocking syscall abuse, event reordering); auditing eBPF as an attack surface (rootkit potential, helper abuse, tampering detection); handling conflicts with other eBPF tools sharing kernel resources (a real Cilium TC classifier incident); and measuring both user-space agent overhead and hidden kernel instrumentation cost. Each lesson includes concrete mitigations such as CI kernel matrices, the open-source ebpf-manager library, aggressive in-kernel event filtering, and dynamic re-hooking on module reload.

40m read timeFrom datadoghq.com
Post cover image
Table of contents
Why we chose eBPF for Workload ProtectionWhat we evaluated before choosing eBPFWhy eBPF stood outSix lessons from running eBPF in production1. Navigate the edge cases of eBPF program loading and kernel hook points2. Safely capturing and enriching kernel data reliably is harder than it looks3. eBPF introduces an attack surface that should be monitored and audited4. Kernel resources are shared—account for other eBPF-based tools5. Measuring performance impact is a necessary evil and a two-step process6. Best practices before rolling out to production—and acknowledging the risksWhat’s next ?Closing thoughts

Sort: