An investigation into unexpected disk space exhaustion on an Amazon RDS Postgres instance caused by an inactive logical replication slot. The root cause is a combination of three factors: RDS writes heartbeats to an internal rdsadmin database every 5 minutes, RDS configures the WAL segment size to 64 MB (vs. the default 16 MB), and the archive_timeout parameter forces a new WAL segment every 5 minutes when there is any database activity. An inactive replication slot retains all these 64 MB WAL segments, leading to ~18 GB/day of disk growth on an otherwise idle database. The fix is to never leave replication slots unattended, set up alerts on WAL retention per slot, and use pg_logical_emit_message() to periodically advance an otherwise stalled slot.
Sort: