Deployment Recovery: Fixing Broken SystemD Services

A logging update accidentally toppled every deployment service. This is the breadcrumb trail I followed to stand them back up.

The Problem

da ran, nothing deployed, and systemd threw error codes like confetti. All services failed at once—impressive, if nothing else.

Root Causes

  1. Unsupported Caddy log format. Apparently I'm better at optimism than version checking.
  2. Duplicate site configs. Past-me forgot to clean up an old file and present-me paid for it.
  3. Overzealous service restarts. Restarting journald confused Caddy and nobody was happy.

Debugging Tactics

Recovery Checklist

  1. Remove the fancy log format and let Caddy stick with JSON.
  2. Delete the stray weblog-gavin.caddy file everywhere.
  3. Restart only Caddy during deploys.

Status

✅ Services recovered. Sites respond, journal behaves, nothing tries to eat the disk.

Lessons Learned

Next Steps

  1. Add a compatible access log format.
  2. Improve error handling in deploy scripts.
  3. Eventually support rollbacks, because I'm not always this lucky.
  4. Build Nimbus to watch all of this for me.
"Every outage is free education, assuming you take notes."

The services survived and I got a checklist out of it. I'll call that a win.