Best of Kubernetes — March 2026

1
Article
Netflix TechBlog·13w
Mount Mayhem at Netflix: Scaling Containers on Modern CPUs
Netflix engineers diagnosed a severe container launch bottleneck when migrating from a virtual kubelet+Docker runtime to kubelet+containerd with per-container user namespaces. The new runtime uses kernel idmap mounts, generating O(n) mount operations per container layer, all competing for global VFS mount locks. On r5.metal instances (dual-socket, multi-NUMA), this caused 30-second health check timeouts and system lockups. Deep profiling with perf and Intel TMA revealed 95.5% of pipeline slots stalled on contested accesses, with NUMA remote memory latency and hyperthreading amplifying the contention. Benchmarks across instance types showed AMD's distributed chiplet cache architecture (m7a) scaled far better than Intel's centralized mesh (m7i), and disabling hyperthreading improved latency 20-30%. The software fix, contributed upstream to containerd, maps the common parent directory of all layers instead of each layer individually, reducing mount operations from O(n) to O(1) per container and eliminating the global lock as a bottleneck entirely.
50
1
2
Video
TechWorld with Nana·12w
She Quit Her Stable Job to Become a DevOps Engineer (And is Now a Golden Kubestronaut)
Anna Pedra, a former industrial engineer in medical devices, shares her journey switching careers into DevOps with no prior IT background. She outlines a 'triple proof strategy' for breaking into the field: professional cloud certifications (AWS, Azure), structured bootcamp training with hands-on projects, and a strong GitHub portfolio. Despite initial interview failures, she landed her first DevOps role at Swisscom after intense months of simultaneous studying, certifying, and applying. She left after less than a year to pursue more relevant Kubernetes and cloud work, and within two years exceeded her original senior salary. Key insights include why certifications alone are insufficient, how portfolio projects drive interview success, and why choosing the right first job matters for long-term career trajectory. She also achieved Golden Kubestronaut status, becoming the first in Switzerland and first woman in Spain to do so.
37
3
3
Video
YouTube·13w
Should you learn Go in 2026?
A Go developer and former Twitch senior engineer makes the case for learning Go in 2026 despite AI disruption and tech layoffs. Key arguments include Go's dominance in cloud-native tooling (Kubernetes, Docker), its suitability for microservices, its simplicity making it AI-friendly for code generation, its robust standard library, and strong job market demand with competitive salaries. The GitHub 2025 developer survey is cited showing Go among the most admired and desired languages.
32
2
4
Article
SkyPilot·11w
Scaling Karpathy's Autoresearch: What Happens When the Agent Gets a GPU Cluster
Karpathy's autoresearch project lets a coding agent autonomously improve a neural network training script by running experiments in a loop. This post scales that setup by giving Claude Code access to 16 GPUs (H100s and H200s) on a Kubernetes cluster via SkyPilot. Over 8 hours, the agent ran ~910 experiments in parallel waves of 10-13, achieving a 9x throughput increase over single-GPU sequential search. Key findings: parallelism enabled factorial grid search instead of greedy hill-climbing, allowing the agent to discover that scaling model width (aspect ratio 96) outperformed all hyperparameter tuning combined. The agent also autonomously developed a two-tier hardware strategy — screening ideas on cheaper H100s and validating winners on H200s — without being prompted. Total cost was under $300 in GPU compute plus ~$9 in Claude API fees. The full setup is available as an open-source example in the SkyPilot repo.
26
2
5
Article
Cloud Native Now·13w
The Efficiency Era: How Kubernetes v1.35 Finally Solves the “Restart” Headache
Kubernetes v1.35 introduces in-place Pod resource resizing as a stable feature, allowing CPU and memory to be updated on running containers without restarts by directly modifying cgroup limits via the CRI. This eliminates the restart penalty for stateful workloads like databases and JVM apps, and makes Vertical Pod Autoscaler safe for production use. The release also stabilizes the trafficDistribution field in Service specs for topology-aware routing to reduce cross-zone egress costs, and advances Dynamic Resource Allocation (DRA) with structured parameters enabling GPU resource slicing for AI/ML inference workloads. The overall theme is operational efficiency over API expansion.
21
6
Video
The Coding Gopher·10w
99% of Developers Don't Get Docker
A deep dive into how Docker actually works under the hood, covering the evolution from hardware virtualization (VMs with hypervisors) to OS-level containerization. Explains the Linux kernel primitives that make containers possible: namespaces (PID, net, mnt, UTS) for isolation and cgroups for resource limits. Covers the union file system and copy-on-write strategy that makes images lightweight and fast. Also walks through Dockerfile optimization via layer caching, data persistence with volumes, and briefly compares Docker Swarm vs Kubernetes for orchestration and Docker vs Podman architecturally.
34
4
7
Article
ITNEXT·11w
kubara: An Open Source Kubernetes Platform Framework Built on GitOps
kubara is an open-source CLI tool written in Go that provides a GitOps-based framework for bootstrapping production-ready Kubernetes platforms. Originally an internal STACKIT project, it is now open source and includes a 'General Distro' — a curated, production-ready baseline that can set up a working multi-cluster Kubernetes platform in under 30 minutes. The framework uses Argo CD as its GitOps engine with a hub-and-spoke multi-cluster architecture, label-based deployments via ApplicationSets, and ships with a curated Helm umbrella chart catalog including Kyverno, Prometheus, Grafana, Loki, Traefik, External Secrets Operator, and OAuth2 Proxy. Security is enforced by default: all ingresses are protected via OAuth2 Proxy, Argo CD RBAC is scoped with projects, and Kyverno policies are applied from the start. The bootstrap process involves three CLI commands (kubara init, kubara generate, kubara bootstrap) and takes 5–10 minutes to get Argo CD running and self-managing the platform.
13
8
Article
OctopusDeploy·10w
Verified Argo CD deployments
Octopus Deploy has added a step verification feature to its Argo CD integration, allowing deployments to wait until Argo CD applications are healthy before proceeding. Three verification modes are available: direct commit (existing behavior), pull request merged, and Argo CD application is healthy. A trigger sync option is also introduced to speed up deployments. The post explains how Octopus tracks intended changes using JSON patches and file hashes, and introduces the concept of 'Git drift' to handle scenarios where Argo CD's view diverges from Octopus's committed changes. The feature is available from version 2026.1.
20
9
Article
Kubeflow·10w
Kubeflow SDK v0.4.0: Model Registry, SparkConnect, and Enhanced Developer Experience
Kubeflow SDK v0.4.0 introduces a ModelRegistryClient for managing model artifacts and versions via a Pythonic API, a SparkClient with SparkConnect support for interactive distributed data processing on Kubernetes without YAML, namespaced TrainingRuntimes for better multi-tenant isolation, and Dataset/Model Initializers to improve parity between local and remote execution. The release also raises the minimum Python version to 3.10 and launches a dedicated documentation website. Future roadmap items include MCP server integration, MLflow support, Kubeflow Pipelines unification, LLM training, and multi-cluster job submission.
13
10
Article
OpenTelemetry·11w
How Mastodon Runs OpenTelemetry Collectors in Production
Mastodon, a non-profit decentralized social platform with ~20 staff, shares how a single engineer runs OpenTelemetry Collectors in production across two large Kubernetes deployments handling up to 10 million requests per minute. The setup uses one Collector per Kubernetes namespace, managed via the OpenTelemetry Operator and Argo CD, with no complex gateway tiers. Traffic is controlled through tail-based sampling (0.1% for successful traces, 100% for errors). The full production config is shared, including OTLP ingestion, Kubernetes metadata enrichment, resource detection, and Datadog export. Key lessons: keep architecture simple, use Kubernetes operators for lifecycle management, rely on semantic conventions, and upgrade frequently.
13
11
Article
Software Testing Magazine·11w
Scaling Your QA Strategy: Why Open Source Cross Browser Testing Tools are the Future of DevSecOps
Engineering teams are shifting from expensive SaaS-based testing clouds to self-hosted, open source cross-browser testing setups running on Kubernetes or Docker. This approach eliminates per-minute billing, enables elastic scaling via Horizontal Pod Autoscalers, and aligns with DevSecOps principles by keeping test artifacts and credentials within the organization's perimeter. Infrastructure-as-Code tools like Terraform and Helm charts allow ephemeral, disposable test environments that support zero-trust security. CI/CD pipelines using GitHub Actions, GitLab CI, or Jenkins can trigger browser containers within the same network boundary, enabling security scanning and compliance checks inline. The cultural shift also improves developer experience by using familiar tools like kubectl and Docker CLI, shortening feedback loops and reducing dependency on vendor platforms.
13
12
Article
ByteByteGo·11w
How Reddit Migrated Petabyte-Scale Kafka from EC2 to Kubernetes
Reddit's engineering team migrated its entire Apache Kafka fleet — over 500 brokers and more than a petabyte of live data — from Amazon EC2 to Kubernetes using Strimzi, with zero downtime and no client-side changes. The migration was executed in six phases: introducing a DNS abstraction layer to decouple clients from broker addresses, freeing up broker ID space by reshuffling EC2 brokers, running a mixed EC2/Kubernetes cluster via a forked Strimzi operator, gradually shifting partition leadership and data using Cruise Control, migrating the control plane from ZooKeeper to KRaft, and finally handing off to the standard Strimzi operator. Key lessons include using abstraction layers to decouple clients from infrastructure, treating logical state as the primary asset to protect, and designing every migration step to be reversible.
10

See all Kubernetes archives