A CPU profiler for Linux (Superluminal) triggered periodic 250ms system freezes on kernel 6.17. Through methodical debugging — minimal eBPF repro, code reading, and kernel mailing list collaboration — the root cause was traced to three bugs in the newly introduced resilient queued spinlock (rqspinlock) used by the eBPF ring buffer. The core issue: a NMI (non-maskable interrupt) could fire between a successful compare-exchange lock acquisition and the update of the held-lock table, causing the AA deadlock detection to miss recursive lock attempts and spin for the full 250ms timeout. Two additional bugs were found: deadlock checks not triggering immediately on spinwait entry (causing 1-2ms stalls), and NMI storms starving the lock holder of CPU time (causing 6-26ms stalls). All three bugs were fixed by kernel maintainers Kumar Kartikeya Dwivedi and Alexei Starovoitov and backported to kernels 6.17 and 6.18.

37m read timeFrom rovarma.com
Post cover image
Table of contents
Initial analysis #Debugging the kernel #Finding a minimal repro #Digging deeper #A relatively short long primer on spinlocks #The queued spinlock #Putting the ‘resilient’ in ‘resilient queued spinlocks’ #Race conditions on a single CPU #Checking for deadlocks…too late #Death by a thousand NMIs #Wrapping up #

Sort: