A Matrix homeserver's TLS certificate expired after 42 hours of silent renewal failures. Caddy was correctly queuing DNS-01 ACME challenge renewals every 10 minutes, but systemd-resolved was hanging on NXDOMAIN queries specifically for the takeonme.org zone while answering all other zones instantly. The root cause was a misconfigured NextDNS DoT setup in resolved.conf that triggered an obscure stub resolver bug. The immediate fix was pinning the Caddy container to use upstream DNS directly (1.1.1.1/8.8.8.8), and the permanent fix was removing the unnecessary NextDNS DoT config from the server. Key lessons: Caddy has no alerting for repeated ACME failures, its staging CA fallback can mask the real problem, DNS health checks should query actual service names not generic ones, and configuration drift from personal-device templates applied to servers can silently break things years later.

15m read timeFrom rant.mvh.dev
Post cover image
Table of contents
The alertsFirst reflex: are the containers up?What Caddy was actually doingThe DNS challenge and what it actually doesBypassing the local resolverThe shape of the bugGetting the cert renewed first, then debugging the restDrilling into the real bugWas this fleet-wide?The actual fixWhat I learned

Sort: