Deployment Recovery: Fixing Broken SystemD Services

debugging systemd recovery caddy

A logging update accidentally toppled every deployment service. This is the breadcrumb trail I followed to stand them back up.

The Problem

da ran, nothing deployed, and systemd threw error codes like confetti. All services failed at once—impressive, if nothing else.

Root Causes

Unsupported Caddy log format. Apparently I'm better at optimism than version checking.
Duplicate site configs. Past-me forgot to clean up an old file and present-me paid for it.
Overzealous service restarts. Restarting journald confused Caddy and nobody was happy.

Debugging Tactics

Read the journal before panicking.
Test components individually: Caddy first, then each deploy script.
Apply one fix at a time and verify before moving on. Future-me thanked me.

Recovery Checklist

Remove the fancy log format and let Caddy stick with JSON.
Delete the stray weblog-gavin.caddy file everywhere.
Restart only Caddy during deploys.

Status

✅ Services recovered. Sites respond, journal behaves, nothing tries to eat the disk.

Lessons Learned

Check version support before copy-pasting config blocks.
When refactoring, delete the corpse of the old config.
Restart the smallest thing that could possibly work.

Next Steps

Add a compatible access log format.
Improve error handling in deploy scripts.
Eventually support rollbacks, because I'm not always this lucky.
Build Nimbus to watch all of this for me.

"Every outage is free education, assuming you take notes."

The services survived and I got a checklist out of it. I'll call that a win.