A CPU profiler for Linux (Superluminal) triggered periodic 250ms system freezes on kernel 6.17. Through methodical debugging — minimal eBPF repro, code reading, and kernel mailing list collaboration — the root cause was traced to three bugs in the newly introduced resilient queued spinlock (rqspinlock) used by the eBPF ring buffer. The core issue: a NMI (non-maskable interrupt) could fire between a successful compare-exchange lock acquisition and the update of the held-lock table, causing the AA deadlock detection to miss recursive lock attempts and spin for the full 250ms timeout. Two additional bugs were found: deadlock checks not triggering immediately on spinwait entry (causing 1-2ms stalls), and NMI storms starving the lock holder of CPU time (causing 6-26ms stalls). All three bugs were fixed by kernel maintainers Kumar Kartikeya Dwivedi and Alexei Starovoitov and backported to kernels 6.17 and 6.18.
Table of contents
Initial analysis #Debugging the kernel #Finding a minimal repro #Digging deeper #A relatively short long primer on spinlocks #The queued spinlock #Putting the ‘resilient’ in ‘resilient queued spinlocks’ #Race conditions on a single CPU #Checking for deadlocks…too late #Death by a thousand NMIs #Wrapping up #Sort: