James Westbury

Tags

“It’s okay,” I thought to myself. “I have backups. It’s fine.” But it wasn’t fine.

I was a one-man shop - the whole IT department. That wasn’t unreasonable, because the business only had about 30 employees, only one of whom was a software developer, and only one of whom was a systems administrator. Fortunately, these two were not the same person, as they might have been at some other companies.

You have no doubt guessed that I was the systems administrator. I was not, strictly speaking, qualified for this role. My education had been in English literature; I had graduated in 2008, right into the Great Recession (what a name, that). Jobs were non-existent, so I thought I’d just stay at my college job for a bit while I found something new. It wasn’t a bad job, selling computers and cameras. I mean, I was bad at the job - I hated the upsell, and I got a stern talking-to from my manager once when I told a customer we didn’t have the high-gain wifi antenna he needed but that if he popped over to Rite-Aid he could buy a Pringles can and make one. I even showed the customer a website walking them through it. Apparently, the right approach would have been to try to tell him that the highest-power antenna we had was exactly what he needed. Still, the job itself wasn’t terrible - I got to help people choose the right technology, which could certainly be rewarding.

Ah, well - the point here is that I was working at Circuit City, which was not long for this world. We closed up shop in early 2009, and I found myself unemployed, a state a remained in for the next year, before I put my college extracurriculars - mostly, LAN parties - to work and landed a job doing tech support on point-of-sale systems. (“POS” systems, you may know, are aptly named; or, at least, aptly acronymed.) That was 2010, and by the end of 2012 I’d become the “IT guy” largely by default: Our previous “IT guy” had been assigned to a special project for a big contract we’d won, and wasn’t able to keep up with his on-site duties.

We had no budget, which makes for some fantastic learning opportunities. All our infrastructure ran on pizza boxes. Our phone system was a real, old-school PBX. Our file storage was a Linux box that a coworker had built a few years earlier. The POS systems we shipped out were imaged one-by-one using an old Windows PE disc with a basic imaging solution on it. Yikes.

I set to work immediately. Within a year, we’d bought in two half-decent servers; I’d started an Active Directory rollout, leveraging our status as a Microsoft Dymamics resellers to secure us some free and/or cheap licenses. I’d replaced our imaging system with a new PXE-boot solution, and built a hacky restoration partition by cobbling together some Perl scripts and the ntfstools package.

I’d even virtualized most of our infrastructure with those new servers, using Hyper-V, and I’d learned some PowerShell so that I could write some management tools. One of the tools was a neat little backup utility. Because half of our VMs were running Linux, we couldn’t use VSS consistently; at the very least, we couldn’t use it while some of our VMs were running. So, the backup tool would shut down a VM, back up its volumes, and start it back up, logging its success.

Now, at this point, I think it may be useful to remind you that being a knowledge sponge only goes so far. When there’s nobody to absorb knowledge from, you make mistakes, but don’t realize you’re making them.

The thing is, I did notice my mistake: Sometimes, the backups didn’t happen. I wasn’t sure why, but even at this early stage of my career, I knew that the first step to dealing with this wasn’t actually solving the problem, but making the problem visible. So, I wrote a “backup checker.” I didn’t have time to actually try restoring all my backups, and this was the next best thing. My latest tool would traverse our NAS and look at the actual backup files. Had the directory been updated as expected? Great! And if it hadn’t, the new Nagios install I’d rolled out would receive a metric and I’d get alerted.

You may have spotted a potential problem here, but let me assure you, the NAS had RAID, and was occasionally backed up off-site. It did not fail. You may have spotted another problem, but keep it to yourself for the moment - we’ll be there soon. In the meantime, there are a few other wrinkles to this story. Let’s address them.

First, we had a peculiar Hyper-V behavior manifesting. Sometimes, after restarting the hypervisor, one or more VMs would just… not exist. Well, the management interface claimed they didn’t, at least. The volumes and manifests were still there, and I could easily remedy this by just re-adding the VM from its virtual hard drive.

Second, Hyper-V volumes aren’t a single file - they’re stored as a base image plus a series of incremental snapshots. (Maybe this is no longer the case. I don’t know - I haven’t used Hyper-V in many years!)

If you thought you spotted the right problem before, you’ll be sure of it now.

One morning, I got a message from our accountant. “The QuickBooks machine is gone.” We had a QuickBooks VM she used. I opened up the management interface, and… yep, it’s gone. Right, I know the process for this one. Open the right dialog, double-click the volume, start up the VM. Sweet. All good. “I restored it, you should be good to go!”

A couple of hours later, I got another message. My stomach dropped straight through the floor. “Hey, uh, we don’t seem to have any orders from the past three months. Do you know what’s going on?” Panic. Panic panic panic. I don’t know what’s going on. More panic.

Wait - yes, I have an idea. I must have restored the base, not the correct snapshot. Okay. Okay. Fine. Let’s just restore the latest snapshot, and… crap. I’ve modified the base volume. I’ve broken the chain. Lindsey Buckingham would not be happy. Nor would our accountant.

“I think there’s an issue with it, let me restore last night’s backup.” Open the backup directory.

It’s empty.

But… the directory modified time was just last night? This is when I learned that a failed backup would still update the directory modified time. The most recent backup I actually had was several weeks earlier. This would work, but I didn’t want our accountant to have to re-enter weeks’ worth of data from paper copies.

“Hey, there’s an issue with our backups. I can definitely get you data from a couple of weeks ago, but let me see what I can do.”

I did some quick Googling. I learned about force-merging snapshots, so I made some quick backups, then gave that a shot… and now the VM wouldn’t boot at all. Still more panic.

In the end, I did manage to save the day. I mounted the force-merged volume on a Linux VM, and was able to track down all of our QuickBooks data files. We lost a couple of hours of data, in the end. At a bigger company, that would be a business-ending event; for us, it was a few extra minutes of work for the accountant.

I fixed the backup checker the next day.


Back then, working for a tiny company, I didn’t do post-mortems, so I can’t tell you what would have come of one. But as I peer back into the past, I can see some excellent lessons. Let’s consider a few, in no particular order.

  • Avoid proxy metrics where possible. If I had validated anything other than the directory timestamp, I likely would have been fine. Backup size, file count, file size, even file names.
  • Know your tech. I had a loose understanding of VHD and VHDX files. If I had a stronger understanding, I might have taken more care in how I restored the VM.
  • Automate even the manual work. We often treat automation as valuable only when it saves time, but this is not true. Significant value comes from reviewable, repeatable processes. Really, even if you aren’t repeating a process, you should script it and get it reviewed if it’s at all possible, because it prevents manual error. If I had written a small script that performed the VM restoration for me, I wouldn’t have mistakenly restored from the wrong file, because the script would have chosen the right file.
  • Absence of errors isn’t proof of success. The backup script was “working,” and the backup checker was “validating,” but something was still wrong. Sometimes, systems fail in ways that don’t manifest in errors, user-facing or otherwise.
  • Mitigation isn’t a fix. Just because you have a mitigation in place - whether it’s a patch, a manual process, or something else - doesn’t mean you’ve fixed the problem, and the mitigation may have its own rough edges or failure modes that compound the original problem. If I had been able to stop the VMs from being “lost” by the control plane, I would never have had the opportunity to make this mistake.

Of course, making mistakes is one of the best ways to learn. Most of my best interview stories are of mistakes I’ve made over the years. I’ve gotten both hired and promoted because of mistakes I’ve made, and the ways I’ve dealt with them - taking ownership of the failure, and taking ownership of the solutions.

This is the first in what I expect to be a short series of posts about the mistakes I’ve made throughout my career. My intent is that others can learn from them; if not, then I hope you can at least enjoy them.