A Bluesky engineer details an 8-hour intermittent outage affecting ~50% of users. The root cause was a missing `errgroup.SetLimit` call in a `GetPostRecord` RPC handler, which allowed batches of 15-20k URIs to spawn tens of thousands of goroutines simultaneously, exhausting TCP ephemeral ports via memcached connection churn. This triggered a death spiral: port exhaustion caused memcached errors, which caused millions of log writes per second, which caused the Go runtime to spawn ~10x more OS threads, which stressed the GC into long stop-the-world pauses, which combined with aggressive GOGC/GOMEMLIMIT settings caused OOMs. On restart, TIME_WAIT sockets blocked new memcached connections, repeating the cycle. The band-aid fix used a custom dialer that randomized the loopback source IP to expand the port space. The true fix was adding the missing concurrency limit. Lessons: add per-client observability, prefer Prometheus/OTEL over high-volume logging, and always bound goroutine concurrency in batch endpoints.
Sort: