Dev Picks the Wrong Database, Takes Down Company

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

Element Creations experienced a 24-hour outage of the matrix.org home server after an engineer accidentally deleted the production database while attempting to restore a failed server. The incident began with a hardware failure requiring database migration, but confusion over which server was primary led to running a destructive command on the wrong machine. Recovery took over a day due to slow backup restoration (51TB), a bug in their backup tool that wasn't patched in production, and slow write-ahead log replay. The postmortem emphasizes faster backup restoration strategies, including local snapshots using copy-on-write filesystems like ZFS, and highlights how operational errors during high-pressure situations are nearly inevitable.

14m watch time
5 Comments

Sort: