Blast radius validation: Large and small Red Hat OpenShift nodes

A benchmarking study comparing blast radius impact between small (64-core) and large (128-core) Red Hat OpenShift compact clusters under 60-65% sustained utilization. Two three-node clusters were tested under both planned and unplanned failure scenarios. Results show large nodes recovered ~47% faster during planned maintenance (5m28s vs 10m15s), while unplanned outage recovery times were nearly identical (11m48s vs 11m05s), dominated by detection timers rather than workload density. The key finding is that blast radius is governed by proportional resource scaling (CPU, network bandwidth, storage I/O) and remediation strategy, not node size alone. The article also details configuration for KubeDescheduler, Node Health Check (NHC), and Self Node Remediation (SNR) operators used in the test setup.

#kubernetes

#openshift

Apr 02•12m read time•From developers.redhat.com

Table of contents

Goals and motivation Test environment overview Test methodology Test results Technical analysis Blast radius revisited Risk mitigation strategies Key findings KubeVirt Descheduler configuration Node health check (NHC) policy Self node remediation (SNR) strategy Integrated failure management flow Red Hat Ansible inventory and VM provisioning parameters Automated stress injection using stress-ng End-to-end VM lifecycle automation

Comment

Bookmark

Copy

Sort: