Best of CloudJanuary 2026

  1. 1
    Article
    Avatar of modal_labsModal·19w

    Keeping 20,000 GPUs healthy

    Modal manages over 20,000 GPUs across AWS, GCP, Azure, and OCI, encountering significant reliability and performance differences between cloud providers. Their GPU health system includes instance type benchmarking and selection, machine image preparation with automated testing, boot-time validation, and continuous passive monitoring (via DCGM and dmesg) plus weekly active healthchecks (DCGM diag, GPUBurn, NCCL tests). Key findings: Cloud providers vary dramatically in H100 performance (up to 50% differences), temperature management (some reaching 94°C), and ECC error rates. GPUs account for 58.7% of training failures in Meta's LLaMA 3 development, compared to just 0.5% for CPUs, highlighting the reliability gap.

  2. 2
    Article
    Avatar of lastweekinawsThe Last Week in AWS·19w

    AWS in 2026: The Year of Proving They Still Know How to Operate

    AWS remains financially strong with $132B annual revenue and 29% market share, but faces challenges in operational excellence and talent retention. While Azure's growth numbers are questionable due to unclear financial reporting, Google Cloud emerges as the real competitive threat with clean 34% growth and $155B backlog. AWS's re:Invent 2025 announcements signal strategic shifts toward multi-cloud acceptance, on-premises investment, and democratized AI model training. Though AWS's AI capabilities have become credible with Nova 2 and Trainium3, execution concerns persist—particularly the October us-east-1 outage response time and 69-81% regretted attrition among senior engineers. Success in 2026 depends on whether AWS can maintain operational excellence while retaining institutional knowledge during organizational restructuring.

  3. 3
    Article
    Avatar of hnHacker News·17w

    Europe wants to end its dangerous reliance on US internet technology

    Europe's heavy reliance on US cloud computing providers (AWS, Microsoft Azure, Google Cloud control ~70% of the market) creates vulnerability to service disruptions from technical failures, geopolitical disputes, or cyber-attacks. Recent outages from AWS and Cloudflare demonstrated this risk. European governments are responding by investing in digital sovereignty initiatives: Schleswig-Holstein replaced 70% of Microsoft licenses with open-source alternatives, France/Germany/Netherlands/Italy are developing sovereign digital platforms, and Sweden built its own collaboration system in domestic data centers. The EU is developing a cloud sovereignty framework and upcoming Cloud and AI Development Act to keep European data under European control, treating digital infrastructure as critically as physical infrastructure.

  4. 4
    Article
    Avatar of newstackThe New Stack·20w

    Bryan Cantrill: How Kubernetes Broke the AWS Cloud Monopoly

    Kubernetes broke AWS's cloud dominance by introducing a vendor-neutral orchestration layer that eliminated API lock-in. Before 2014, AWS seemed unbeatable with five times the capacity of competitors and relentless execution. Companies felt trapped by AWS APIs, believing competitors like Google Cloud and Azure could never catch up without API compatibility. Kubernetes changed this by allowing applications to be built against its APIs instead of cloud-specific ones, enabling true multi-cloud portability. Google open-sourced Kubernetes to encourage cloud neutrality, knowing they had the most to gain as the underdog. While AWS still leads with 30% market share, the cloud market has expanded into a trillion-dollar industry with diverse participants, partly thanks to Kubernetes democratizing infrastructure orchestration.

  5. 5
    Article
    Avatar of rubylaRUBYLAND·20w

    Big Tech Exit

    A developer shares their journey toward digital independence from Big Tech companies, documenting current dependencies across Apple, Microsoft, and Google services. The author outlines specific 2026 goals including migrating from iCloud to self-hosted alternatives like Immich and Jellyfin, moving remaining projects from GitHub to Codeberg, testing PostmarketOS on a Fairphone 5, and setting up a Pi-hole for DNS privacy. They advocate for incremental progress using the "plus one rule" - adding alternatives alongside existing services rather than forcing immediate switches - and encourage others to start with small, manageable changes.

  6. 6
    Article
    Avatar of gzasiv4jjdtovk6orcp3xBarion·19w

    i think i'm cooked

  7. 7
    Article
    Avatar of wheresyouredWhere's Your Ed At·16w

    Premium: The Hater's Guide to Oracle

    Oracle, a dominant enterprise software and database provider, has dramatically shifted from its profitable core business to massive AI infrastructure investments. The company quintupled its property and equipment from $12.8B to $67.85B between 2022-2025, primarily acquiring GPUs for cloud compute. Oracle signed a $300B contract with OpenAI starting in 2027, but faces severe challenges: negative cash flow of $13B last quarter, $56B in new debt, $248B in data center lease obligations, and gross margins falling from 79% to 68.54%. The company's survival now depends entirely on OpenAI's ability to pay—requiring OpenAI to generate four times Microsoft Azure's annual revenue while burning billions. Oracle also acquired 15% of TikTok, which loses billions annually. Founder Larry Ellison's wealth, tied almost entirely to Oracle stock, faces existential risk if OpenAI cannot fulfill its obligations.