James Westbury

Tags

Have you ever had to rotate the certificates for a root authority? It’s not what you’d describe as a “fun time” - unlike your usual certificate rotation, a root CA rotation means ensuring all of your clients also trust the new root authority. If you have been oh so careful - careful beyond what might be expected of any human, let alone any group of humans - you might be able to simply update the system trust stores on all of your hosts. If you’re using Windows and Active Directory, all the better.

But, we should be honest with ourselves: Nobody is this careful. There’s always a system - always some service, some application - which leverages its own trust store. Consider, for instance, that random Python script that somebody wrote as a one-off which slowly and silently become part of your foundational infrastructure. Consider the team who decided they wanted to use Java for their project, and is defining their keystore in some config file that’s checked into a repo that your networking and infrastructure security teams don’t have read access to. Consider any of the myriad ways someone can break your assumptions.

Consider, perhaps, wrapping a Python-derived CLI with a Java app. Such was the application I was concerned by. This application - name omitted to protect the innocent (and the guilty) - was what I learned to call a “canary” when I worked at AWS. Now, AWS has a different definition of “canary” than has become standard in the industry: It is an application which emulates user behavior, calling endpoints and validating that they behave as expected. This particular canary was entirely undocumented, and was written by an engineer who had long since left the company. I was on the team handling the certificate rotation, not the team who owned the canary, but - for complex organizational reasons - I was on the hook to get it updated.

At the outset, the team who owned this canary told me it was a Java application. “Well,” I thought, “I know how to update the Java keystore.” This tool was packaged up as an AMI - an EC2 image - which was bootstrapped at runtime. To get the new trust store installed, I just had to spin up a fresh base image, install the new CA cert to the location the Java app was already configured to use, and use this as the input for our image build system. No problem. Updated. The canary, annoyingly, is stateless - but in such a way that only a single copy can be running at a time. And did I mention there’s no staging environment? Well, no matter - this is a simple change, let’s just YOLO it into production.

BEEP BEEP BEEP. BEEP BEEP BEEP. BEEP BEEP BEEP.

Ah. That’s a pager. You’ve heard one before? This was back in the physical pager days - the real deal.

Apparently the service the canary is testing is down. Or, well… the canary says it is. I’m being paged into an incident, where they’re already deciding how to communicate the outage to our customers. “Sorry, folks, the service is still up. This is just a bad canary deployment. Rolling back now. Give it five minutes.”

And, yeah - five minutes later, the alert resolves on its own.

This is when I knew I had to dig deeper. I tracked down the application code - fortunately, I did have read access - and discovered that this was, as we previously discussed, just a Java shim around a Python-based CLI. The JVM keystore was not in use at all - the real culprit was wherever the requests library was looking. Well, that’s another easy one to solve. Just push the new cert to the system store, and update REQUESTS_CA_BUNDLE in the userdata for the instance.

But I knew better this time. Don’t deploy straight to production. I had my own AWS test account, so I spun up a fresh test environment where I could validate the behavior without interfering with the rest of our infrastructure. This included a mock copy of the test application, which my new test image would call. So, I grabbed our existing base image, updated the environment variables, yada yada yada. I started up the canary.

BEEP BEEP BEEP. BEEP BEEP BEEP. BEEP BEEP BEEP.

Sigh. Paged into an incident again. But this is odd - how could my test be affecting the production canary? It can’t. I listened in for a bit, until a tenured engineer said, “This smells like we have two copies of the canary running.” I piped up. “Uh… I have a copy running in my test account, but it’s pointed to a mock service…?”

As you can surely guess, my test was not, in fact, pointed to a mock service. As it turns out, the person who built this base image made a perplexing - confounding, really - decision. They baked the credentials and endpoints into it. The bootstrapping process wasn’t overwriting these, so my test canary was impersonating the production canary, calling the production endpoints. Oh boy.

“Okay, I’m shutting down my test. Just a minute.” Lo and behold, the alerts resolved on their own a few minutes later.

On the bright side, we’d successfully validated this new approach to deploying the CA cert. I rebuilt from a base Amazon Linux image - ensuring that we didn’t bake the credentials and endpoints in! - and our next deployment was smooth sailing. I volunteered to write the post-mortem, and also volunteed to build out a staging environment for this canary, writing up extensive documentation on the canary in the process - and detailing how to build out a new environment, in case we ever needed to build out regional versions, etc.

I’ve worked on a few root CA rotations since then, and I’ve always maintained an almost unhealthy level of caution, largely because of my experience with this canary when I was a junior engineer. As ever, it’s the mistakes that teach you the most.

Postscript: Four months later, we were entering our yearly review cycle, and my manager suggested it might be time to put me up for promotion. We worked together to prepare a promotion document, and a major component of it was the post-mortem I had written, and the work I had done to mitigate future impact. In the years since, this has become a cornerstore of the stories I tell during job interviews: I made some unfortunate mistakes with high visibility at fairly senior levels, but owning the mistakes led to some of my earliest career growth. It’s a story I also tell the junior engineers I work with now, not only to teach them that mistakes are okay, but that the most important thing is how we react once we make them.