Railway replaced its Docker-buildx GCP autoscaler with a fleet of microVM build cells running BuildKit on bare-metal hosts, now processing 66,000 builds per hour at peak. The post details the full engineering journey: a standalone binary refactor that collapsed four layers into one process (cutting ~20s from cached builds), networking failures from VM bridge issues, cgroup v2 isolation to fix stuck builds caused by runc workers starving buildkitd, and a 10-second optimization from reading OCI manifest metadata instead of decompressing image layers. The migration spanned GCP and AWS before landing on bare-metal, with the egress problem now resolved.

10m read timeFrom blog.railway.com
Post cover image
Table of contents
Table of ContentsWhat Builder v3 actually isThe first cracks: host versus guestThe rollbackThe cgroup storyThe 10-second metadata winWhat the picture looks like nowThe cheapest build is the one we don't run

Sort: