A deep technical analysis of the April 2026 Bluesky outage, examining the cascade of failures that caused the incident. The post explains how large batch RPC calls caused goroutines to exhaust ephemeral TCP ports when connecting to memcached, how TIME_WAIT state prevented port reuse, and how this led to a memory cascade via Go runtime spawning excessive OS threads. The improvised fix involved randomly selecting from Linux's full 127.0.0.0/8 loopback range to multiply available (IP, port) pairs. Also covers the gomemcache connection pool behavior, diagnostic challenges during multi-failure incidents, and frames the failure patterns using David Woods' 'Messy 9' resilience engineering framework.
Table of contents
Interpreting the error messageSaturation, part 1: ephemeral portsBitten by time lags – TIME_WAITSaturation, part 2: memoryThe workaround: leveraging multiple loopbacksDiagnostic challengesThe messy 91 Comment
Sort: