Running database migrations safely during zero-downtime ECS rolling deployments requires strict sequencing guarantees that ECS doesn't provide natively. This post presents an event-driven architecture using ECR EventBridge events, AWS Step Functions, and digest-pinned Fargate tasks to ensure Alembic migrations run exactly once and complete successfully before any new application containers serve traffic. The approach eliminates the CI migration race condition and the init container concurrency problem (N parallel Alembic processes). Key design decisions include digest pinning over tag pinning for immutable image references, a dedicated migration task definition with DDL-privileged credentials, and using ecs:runTask.sync for event-driven task completion detection. The post also covers PostgreSQL DDL locking mitigations (lock_timeout, CREATE INDEX CONCURRENTLY) and provides Terraform skeleton code for the infrastructure.
Table of contents
The Problem: Schema Changes and Rolling DeploymentsThe ArchitectureWhy the Naive Patterns BreakWalking Through Each StepKey Design DecisionsTerraform SkeletonSummarySort: