Hardening eBPF for runtime security: Lessons from Datadog Workload Protection

Datadog engineers share six hard-won lessons from five years of running eBPF-powered workload protection at scale. The post covers: navigating kernel version and distribution compatibility pitfalls (verifier changes, hook point availability, function inlining); safely capturing kernel data (CO-RE offsets, TOCTOU races, paged-out user memory, non-linear skbs); maintaining consistent eBPF maps and user-space caches (LRU eviction quirks, blocking syscall abuse, event reordering); auditing eBPF as an attack surface (rootkit potential, helper abuse, tampering detection); handling conflicts with other eBPF tools sharing kernel resources (a real Cilium TC classifier incident); and measuring both user-space agent overhead and hidden kernel instrumentation cost. Each lesson includes concrete mitigations such as CI kernel matrices, the open-source ebpf-manager library, aggressive in-kernel event filtering, and dynamic re-hooking on module reload.

#devops

#linux

#observability

Feb 23•40m read time•From datadoghq.com

Table of contents

Why we chose eBPF for Workload Protection What we evaluated before choosing eBPF Why eBPF stood out Six lessons from running eBPF in production 1. Navigate the edge cases of eBPF program loading and kernel hook points 2. Safely capturing and enriching kernel data reliably is harder than it looks 3. eBPF introduces an attack surface that should be monitored and audited 4. Kernel resources are shared—account for other eBPF-based tools 5. Measuring performance impact is a necessary evil and a two-step process 6. Best practices before rolling out to production—and acknowledging the risks What’s next ?Closing thoughts

Comment

Bookmark

Copy

Sort: