PostgreSQL's now() function returns the transaction start time, not the current wall clock time — it stays frozen for the entire duration of a transaction. This caused a subtle bug in distributed locking logic inside the Emmett framework, where a retry loop wrapping multiple calls to a stored procedure all ran inside one transaction. Because now() never advanced, a timeout predicate checking whether a previous processor owner had gone stale always evaluated the same way, making retries useless. The fix was to use clock_timestamp() instead, which reads the actual wall clock on every call regardless of transaction boundaries. The post also reflects on why existing tests missed the bug: unit-level stored procedure tests never combined a retry policy with stale-row state, and end-to-end tests didn't hit the specific combination of crash, new instance ID, and retry timeout. The lesson is to write tests at the seam where the inner test invokes code differently from how production actually drives it.

11m read timeFrom architecture-weekly.com
Post cover image
Table of contents
The bugWhy my tests didn’t catch itWhat I’m taking awayTLDR

Sort: