Best of Kubernetes — January 2026

1
Video
Kiki's Bytes·20w
How I Corrupted 2 Million+ DB Records in Prod
A migration script to encrypt sensitive user data in MongoDB corrupted over 2 million records by accidentally wiping out unencrypted fields. The bug occurred because MongoDB's $set operation replaces entire nested objects rather than merging properties. Running the migration twice resulted in both encrypted and unencrypted fields being set to undefined, causing queries to return millions of records and spike memory usage in Kubernetes pods. Recovery required restoring a database backup from before the migration and cross-referencing with Redshift data to identify users who registered during the incident window.
154
13
2
Video
CodeHead·21w
The Best Way To Learn DevOps in 2026
Learning DevOps effectively means understanding it as a mindset for reliably moving code to production, not just a collection of tools. Start with Linux fundamentals (processes, networking, system commands) before diving into containers and orchestration. Follow a single application through its entire lifecycle: write it, containerize it, deploy it, break it intentionally, and observe what happens. Implement CI/CD pipelines for consistency, learn cloud infrastructure as code with tools like Terraform, and master observability through logs, metrics, and traces. The key is choosing the simplest architecture that works and only adding complexity when it solves real problems, not to pad a resume.
139
3
3
Article
MetalBear·19w
How Our Engineering Team Uses AI
MetalBear's engineering team shares practical experiences using AI coding tools while building mirrord, a Kubernetes development tool in Rust. AI proves most valuable for understanding unfamiliar code, exploring architectural alternatives, and generating scripts. It struggles with complex architectures and long-running reasoning tasks. The team finds ChatGPT most reliable for iteration, Gemini best for deep research but prone to losing context, and Claude Code somewhere in between. Success depends on scoping problems tightly and controlling context rather than which model is used. AI accelerates tedious and exploratory work but cannot replace deep system understanding.
127
4
4
Article
freeCodeCamp·19w
Build Your Own Kubernetes Operators with Go and Kubebuilder
A comprehensive 6-hour video course teaches how to build custom Kubernetes operators and controllers from scratch using Go and Kubebuilder. The course covers controller theory, Kubernetes extensibility, environment setup, API and logic building, hands-on development, and advanced internals including Informers, Caches, Finalizers, and Idempotency. A practical example demonstrates managing AWS EC2 instances directly from Kubernetes, treating Kubernetes as an SDK rather than just a deployment platform.
109
6
Article
InfoQ·17w
OpenEverest: Open Source Platform for Database Automation
Percona announced OpenEverest, an open-source platform for automated database provisioning and management on Kubernetes. Built on Kubernetes operators, it supports MySQL, PostgreSQL, and MongoDB, offering features like automated backups, scaling, and disaster recovery while avoiding vendor lock-in. The platform provides both a web UI and REST API for managing database clusters. Originally launched as Percona Everest, it's transitioning to independent open governance with plans to donate to the CNCF. The latest version adds PostgreSQL 18.1 support and NodePort networking, with future plans to support ClickHouse, Vitess, and observability integrations.
89
1
7
Article
Grab Tech Blog·19w
Docker lazy loading at Grab: Accelerating container startup times
Grab implemented Docker image lazy loading using SOCI (Seekable OCI) technology to solve slow container startup times caused by large images. The solution achieved 4x faster image pull times on fresh nodes, 30-40% faster P95 startup times in production, and 60% improvement in download times after configuration tuning. Unlike traditional image pulls that download all layers before starting, lazy loading uses remote snapshotters to fetch data on-demand via FUSE filesystems. Grab chose SOCI over eStargz because it's natively supported on Bottlerocket OS, doesn't require image conversion, and maintains the same application startup time as standard images while dramatically reducing image pull time.
83
4
8
Article
Three Dots Labs·18w
The Distributed Monolith Trap (And How to Escape It)
Microservices solve organizational problems more than technical ones. Starting with a well-structured modular monolith is recommended over premature service splitting. Tight coupling persists even when services communicate over HTTP instead of function calls. Key warning signs include excessive inter-service calls and frequent cross-service changes. Event storming helps identify proper boundaries by mapping domain events and behaviors. Sometimes merging tightly coupled microservices back together is the right solution. Team structure should drive architecture decisions (reverse Conway maneuver), not the other way around.
78
3
9
Video
TechWorld with Nana·19w
If I would start DevOps from 0 - How would I start and what would I learn
A structured learning path for DevOps beginners breaks down into six phases over several months. Start with Linux fundamentals, bash scripting, and git (1-2 months). Move to cloud basics focusing on AWS compute, storage, and networking (1-2 months). Learn infrastructure as code with Terraform (1 month). Master containerization with Docker and Kubernetes (1-2 months). Build CI/CD pipelines with Jenkins, GitHub Actions, or GitLab CI (1-2 months). Finally, cover observability with Prometheus and Grafana (1 month). The key mistake to avoid is learning tools in isolation—instead, combine technologies through hands-on projects that build on each other continuously rather than starting from scratch each time.
72
4
10
Article
ByteByteGo·20w
How Lyft Built an ML Platform That Serves Millions of Predictions Per Second
Lyft built LyftLearn Serving, an ML platform handling millions of predictions per second using a microservices architecture. Instead of a shared monolithic system, they generate independent microservices for each team via configuration templates. The platform separates data plane concerns (runtime performance, inference execution) from control plane concerns (deployment, versioning, testing). Key features include automated model self-tests, flexible library support (TensorFlow, PyTorch), and dual interfaces for engineers and data scientists. The architecture uses Flask/Gunicorn for HTTP serving, Kubernetes for orchestration, and Envoy for load balancing. Over 40 teams migrated from the legacy system, achieving team autonomy while maintaining platform consistency.
53
1
11
Article
Red Hat Developer·21w
The state of open source AI models in 2025
2025 saw significant growth in open source AI models, particularly from Chinese labs like DeepSeek, Qwen, and Moonshot AI's Kimi K2. These models now rival proprietary options like ChatGPT while offering cost control and on-premises deployment. The landscape includes model families of various sizes (from 0.5B to 1T parameters) for different use cases: Qwen for versatility, Kimi K2 for agentic workflows and coding, OpenAI's gpt-oss for tool calling, and small language models for edge devices. Enterprise adoption is growing in regulated sectors requiring data sovereignty. Tools like Ollama, RamaLama, and vLLM make deployment accessible, from local hardware to production Kubernetes environments.
52
1
12
Article
Arcjet·20w
Arcjet's tech stack
Arcjet's architecture combines WebAssembly modules written in Rust embedded in SDKs, a Go-based gRPC decision API for low-latency security decisions, and a region-aware data pipeline using AWS SNS, SQS, and ClickHouse. The stack includes TypeScript/Python SDKs, Valkey for rate limiting, DynamoDB for dynamic rules, and runs on AWS EKS with isolated regional deployments. Development uses devcontainers with Docker Compose and LocalStack for AWS emulation, while security is layered with automated scanning tools and dependency management.
45
2
13
Article
CNCF·20w
OpenCost: Reflecting on 2025 and looking ahead to 2026
OpenCost, a CNCF incubating project for Kubernetes cost management, released 11 versions in 2025 with major features including Prometheus-optional operation, an AI-powered MCP server for natural language cost queries, and improved cloud provider support. The project expanded through mentorship programs that delivered integration testing, the MCP server, and KubeModel for Data Model 2.0. Looking ahead to 2026, priorities include completing KubeModel, adding AI usage costing features, and enhancing supply chain security.
43
14
Article
ByteByteGo·19w
How Pinterest Built An Async Compute Platform for Billions of Task Executions
Pinterest rebuilt their asynchronous job processing platform from Pinlater to Pacer to handle billions of daily tasks. The original system suffered from database lock contention, lack of queue isolation, and inefficient sharding. Pacer introduced dedicated dequeue broker services managed by Helix, eliminated lock contention through single-broker partition ownership, implemented in-memory caching for sub-millisecond latency, adopted adaptive sharding based on queue size, and isolated worker pods on Kubernetes with custom resource allocations per queue.
35
15
Article
OctopusDeploy·20w
What's new in Argo CD 3.2?
Argo CD v3.2 GA release introduces significant improvements across multiple areas: UI enhancements including hydration status on app tiles and sortable columns, ApplicationSet controller performance upgrades with better concurrency and error reporting, new health checks for DatadogMetric resources, server-side diff support for more accurate resource comparison, and Hydrator upgrades with custom commit messages and automatic .gitattributes generation. These changes improve usability, observability, and reliability for teams running GitOps at scale.
34
16
Article
Salesforce Engineering·18w
Automating Global Rollback for 1.5 Trillion Requests in 10 Minutes
Salesforce Edge team reduced global rollback time from 8-12 hours to 10 minutes by implementing a blue-green deployment architecture on Kubernetes. The solution maintains two fully scaled deployments simultaneously, with custom autoscaling logic that evaluates CPU across both fleets to ensure capacity parity. Traffic cutover is automated through service label updates combined with explicit TCP connection draining via mutual TLS, enabling rapid recovery while preserving four-nines availability for a platform handling 1.5 trillion monthly requests and 23 petabytes of traffic across 21+ global points of presence.
30
1
17
Article
CNCF·18w
Introducing Kthena: LLM inference for the cloud native era
Kthena is a new open-source sub-project of Volcano designed for LLM inference orchestration on Kubernetes. It addresses production challenges like low GPU/NPU utilization, latency-throughput tradeoffs, and multi-model management through intelligent routing, KV Cache-aware scheduling, and Prefill-Decode disaggregation. The system includes a high-performance router and controller manager that support topology-aware scheduling, gang scheduling, autoscaling, and multiple inference engines (vLLM, SGLang, Triton). Benchmarks show 2.73x throughput improvement and 73.5% TTFT reduction compared to random routing. Backed by Huawei Cloud, China Telecom, DaoCloud, and other industry partners.
30
18
Article
CNCF·21w
Viettel joins CNCF as a Gold Member
Viettel, Vietnam's largest telecommunications and technology group, has joined the Cloud Native Computing Foundation as a Gold Member. The company operates one of Southeast Asia's largest open source cloud infrastructures with extensive production deployments of OpenStack and Kubernetes supporting enterprise, government, and critical national infrastructure workloads. Viettel will contribute operational expertise from telecom environments and real-world experience operating cloud native platforms at national scale, with focus on security, data sovereignty, and large-scale production reliability.
30
2
19
Article
The New Stack·21w
Bryan Cantrill: How Kubernetes Broke the AWS Cloud Monopoly
Kubernetes broke AWS's cloud dominance by introducing a vendor-neutral orchestration layer that eliminated API lock-in. Before 2014, AWS seemed unbeatable with five times the capacity of competitors and relentless execution. Companies felt trapped by AWS APIs, believing competitors like Google Cloud and Azure could never catch up without API compatibility. Kubernetes changed this by allowing applications to be built against its APIs instead of cloud-specific ones, enabling true multi-cloud portability. Google open-sourced Kubernetes to encourage cloud neutrality, knowing they had the most to gain as the underdog. While AWS still leads with 30% market share, the cloud market has expanded into a trillion-dollar industry with diverse participants, partly thanks to Kubernetes democratizing infrastructure orchestration.
27
2
20
Article
BigData Boutique blog·18w
OpenSearch Kubernetes Operator 3.0 - Stability and Resilience Finally Delivered
OpenSearch Kubernetes Operator 3.0 Alpha introduces major stability improvements including quorum-safe rolling restarts, multi-namespace support, TLS certificate hot reloading, and gRPC API support. The release addresses critical production issues like upgrade deadlocks, split-brain scenarios, and cluster instability through over 100 changes. Key features include SmartScaler enabled by default, init-container and sidecar support, NFS volumes, and OpenSearch 3.0 compatibility. The API is migrating from opensearch.opster.io to opensearch.org with automatic migration handling. Breaking changes include new security defaults and enabled validation webhooks. The alpha is recommended for testing in lower environments first, with GA release planned after beta testing.
22
21
Article
CNCF·20w
CNCF Announces Dragonfly’s Graduation
Dragonfly, a cloud native image and file distribution system, has graduated from CNCF after demonstrating production readiness and widespread adoption. The project uses peer-to-peer technology to distribute container images, OCI artifacts, and AI models at scale, saving up to 90% storage bandwidth and reducing launch times from minutes to seconds. Major organizations including Ant Group, Alibaba, Datadog, DiDi, and Kuaishou use Dragonfly to power large-scale container and AI workloads. Since joining CNCF, the project has seen over 3,000% growth in code contributions, expanding from 45 contributors across 5 companies to 271 contributors across 130+ companies. Future development will focus on accelerating AI model distribution using RDMA, optimizing image layouts for AI workloads, and implementing load-aware scheduling.
21
22
Article
Halodoc·18w
Reducing Amazon EKS Compute Costs by 35%: Migrating Production Workloads from Graviton3 to Graviton4
Halodoc migrated their Amazon EKS workloads from Graviton3 to Graviton4 processors, achieving 35% cost savings through a data-driven approach. The migration involved two-phase validation: hardware benchmarking with Sysbench showed 28% CPU throughput improvement and 64% memory bandwidth gains, while application-level testing with JMeter demonstrated lower latency and resource utilization. By combining the processor upgrade with strategic resource right-sizing (15% CPU and 10% memory reduction), they reduced node count by 40% and maintained performance while cutting costs. The zero-downtime migration used controlled node pool rebalancing, followed by a one-week stabilization period before applying resource optimizations.
20
23
Article
Istio·19w
Announcing Istio 1.28.3
Istio 1.28.3 is a patch release that fixes several bugs to improve robustness. Key fixes include resolving goroutine memory leaks in ambient mode, addressing informer failures in ambient multicluster setups that previously required istiod restarts, and fixing NFT operation crashes and pod deletion failures. The release also adds a service.selectorLabels field to the gateway Helm chart for custom service selector labels during revision-based migrations.
16
1
24
Article
Isovalent·19w
What Is Kubernetes Networking?
Kubernetes networking enables communication between pods, services, nodes, and external resources through a flat network structure where each pod receives its own IP address. The Container Network Interface (CNI) manages pod networking, IP assignment, and routing without requiring network address translation for internal traffic. Core principles include unique pod IPs, direct pod-to-pod communication across nodes, shared network namespaces within pods, and Services that provide stable virtual IPs for load balancing. Network Policies control traffic flow between pods for security. CNI plugins like Cilium use eBPF for high-performance routing and enhanced observability, replacing traditional iptables-based approaches.
15

See all Kubernetes archives