A benchmarking study comparing blast radius impact between small (64-core) and large (128-core) Red Hat OpenShift compact clusters under 60-65% sustained utilization. Two three-node clusters were tested under both planned and unplanned failure scenarios. Results show large nodes recovered ~47% faster during planned maintenance (5m28s vs 10m15s), while unplanned outage recovery times were nearly identical (11m48s vs 11m05s), dominated by detection timers rather than workload density. The key finding is that blast radius is governed by proportional resource scaling (CPU, network bandwidth, storage I/O) and remediation strategy, not node size alone. The article also details configuration for KubeDescheduler, Node Health Check (NHC), and Self Node Remediation (SNR) operators used in the test setup.

Table of contents
Goals and motivationTest environment overviewTest methodologyTest resultsTechnical analysisBlast radius revisitedRisk mitigation strategiesKey findingsKubeVirt Descheduler configurationNode health check (NHC) policySelf node remediation (SNR) strategyIntegrated failure management flowRed Hat Ansible inventory and VM provisioning parametersAutomated stress injection using stress-ngEnd-to-end VM lifecycle automationSort: