April 2026 Outage Post-Mortem - Jim's Pckt

A Bluesky engineer details an 8-hour intermittent outage affecting ~50% of users. The root cause was a missing `errgroup.SetLimit` call in a `GetPostRecord` RPC handler, which allowed batches of 15-20k URIs to spawn tens of thousands of goroutines simultaneously, exhausting TCP ephemeral ports via memcached connection churn. This triggered a death spiral: port exhaustion caused memcached errors, which caused millions of log writes per second, which caused the Go runtime to spawn ~10x more OS threads, which stressed the GC into long stop-the-world pauses, which combined with aggressive GOGC/GOMEMLIMIT settings caused OOMs. On restart, TIME_WAIT sockets blocked new memcached connections, repeating the cycle. The band-aid fix used a custom dialer that randomized the loopback source IP to expand the port space. The true fix was adding the missing concurrency limit. Lessons: add per-client observability, prefer Prometheus/OTEL over high-volume logging, and always bound goroutine concurrency in batch endpoints.

#golang

#observability

Apr 10•7m read time•From pckt.blog

Table of contents

The Problem The Root Cause Death Spiral Summary

Comment

Bookmark

Copy

Sort: