A Java developer investigates why a Quarkus monitoring application running in Kubernetes keeps getting killed with exit code 137 (OOM kill). The talk walks through the forensic process: examining Kubernetes logs, Grafana metrics, heap dumps, and native memory tracking reports. Key findings include that 150 threads with 2MB default stack size consume ~300MB of native memory, HTTPS buffers add overhead, and native memory fragmentation may be a culprit. The investigation covers Java's three memory areas (heap, non-heap, native), JVM tuning parameters, and practical tips like keeping the container alive with extra memory to gather evidence, exacerbating the issue to reproduce it faster, and using LLMs cautiously for analysis. The root cause remains unresolved, with a Netty committer suggesting that Quarkus/Netty uses unsafe memory allocation not trackable by JVM tools, recommending pmap for further investigation.
Sort: