Deployment Recovery: Fixing Broken SystemD Services
A logging update accidentally toppled every deployment service. This is the breadcrumb trail I followed to stand them back up.
The Problem
da ran, nothing deployed, and systemd threw error codes like confetti. All services failed at once—impressive, if nothing else.
Root Causes
- Unsupported Caddy log format. Apparently I'm better at optimism than version checking.
- Duplicate site configs. Past-me forgot to clean up an old file and present-me paid for it.
- Overzealous service restarts. Restarting journald confused Caddy and nobody was happy.
Debugging Tactics
- Read the journal before panicking.
- Test components individually: Caddy first, then each deploy script.
- Apply one fix at a time and verify before moving on. Future-me thanked me.
Recovery Checklist
- Remove the fancy log format and let Caddy stick with JSON.
- Delete the stray
weblog-gavin.caddyfile everywhere. - Restart only Caddy during deploys.
Status
✅ Services recovered. Sites respond, journal behaves, nothing tries to eat the disk.
Lessons Learned
- Check version support before copy-pasting config blocks.
- When refactoring, delete the corpse of the old config.
- Restart the smallest thing that could possibly work.
Next Steps
- Add a compatible access log format.
- Improve error handling in deploy scripts.
- Eventually support rollbacks, because I'm not always this lucky.
- Build Nimbus to watch all of this for me.
"Every outage is free education, assuming you take notes."
The services survived and I got a checklist out of it. I'll call that a win.