The last post ended on a promise: the rebuild got tested sooner than I wanted. Here’s how.
I’d just finished rebuilding everything. New architecture, new docs, the agent workflow I’d spent weeks getting right. The server was back up, the VMs were running, and for the first time in a month the homelab looked boring again. Boring is the goal.
It lasted a day.
Three of five
The next afternoon, the new server’s disk array started throwing errors. Then more. Within a few hours, three of the five SSDs in the array were dead.
Three of five. A RAID survives losing one disk, sometimes two depending on the layout. It does not survive three. The array was gone, and the server with it.
The disks were almost certainly collateral from the original incident — the same heat and electrical stress that killed the first server had been sitting latent in these, waiting. And because of the hardware market at the time, I couldn’t just order replacements and have them next day. I was down a server again, with no quick way to buy myself out of it.
A month earlier, this would have broken me. It didn’t — and the reason why is the whole point of this series.
The difference between days and hours
The first disaster took days and nearly ended my IT career. This one took hours. Same severity — arguably worse, because I couldn’t source parts. What changed was everything sitting behind the failure.
By then I had three layers, and none of them depended on the others surviving:
1. Backups — the boring old kind. I’d stood up PBS (Proxmox Backup Server) during the rebuild, but here’s the catch: I hadn’t set it up for the host that died yet. The shiny new backup system didn’t cover the machine that actually failed. What did exist were old-style, cron-driven gzip dumps of my docker volumes — the unglamorous backup I’d been running for years — sitting redundantly on the NAS and in the cloud. The backup nobody brags about was the one that saved me.
2. Infrastructure as code — ten years deep. What did hold was the IaC. A decade of scripts, rebuilt from scratch after the first disaster into the agent-readable form I wrote about earlier. Every machine, every service, every mount and firewall rule: described, versioned, reproducible. A dead server stops being a catastrophe when the server is really a document you can re-run.
3. The workflow. And this was the genuinely new part: I didn’t face the recovery alone, and I didn’t face it as a panicked human typing commands into the dark. I told the agent what had happened. Reading the IaC and the documentation, it worked out where each piece belonged and flagged what was missing astonishingly fast — which dump went where, what the new backup hadn’t caught. We agreed on a plan, and then it did the heavy lifting while I reviewed and steered.
The salvage rack
There was no clean replacement hardware, so I improvised, the same way I had the first time: a gaming PC, an old HP all-in-one, every old laptop I could dig out. A rack of mismatched machines, held together by the fact that what ran on them was fully described and didn’t much care what it ran on.
The first time I did this, it was frantic and it took days — because the knowledge of how everything fit together lived in my head and nowhere else. This time the knowledge lived in the repo. I told the agent the situation, we made a plan, and we restored everything together. In hours, not days.
What actually saved me — and it wasn’t the AI
It would be easy to write this up as “AI saved my infrastructure.” It didn’t, and I want to be precise about that, because the precise version is the one that matters.
The AI was fast hands on a plan I directed, reading from documentation I had disciplined into shape, working out where the old volume dumps belonged and what the new backup hadn’t caught. Pull any one of those layers and the recovery gets slower or fails outright. The agent without the docs is a confident stranger. The docs without the dumps lose data. The dumps without the IaC restore you onto a machine you no longer understand.
The thing that turned a career-ending disaster into a long, annoying afternoon was defence in depth — boring, layered, deliberate engineering. The AI made it faster. It did not make it work.
That’s the distinction I’d want any engineer — or anyone nervous about how I work — to take from this. I don’t hand my infrastructure to an AI and hope. I build a system that survives failure on its own merits, and then I let the AI move fast inside it.
Next: the architecture that came out of all this — why every agent in my setup now gets its own machine, and how the whole thing is laid out to fail safely.
Part of The 2026 Rebuild.
