Put applications and data through the same workflow: the CrowdStrike incident

Sunday, July 21, 2024

I like to think I keep on top of Hacker News, and that I’m in the know that the world is on fire before most of my nontechnical peers. So imagine my surprise when my mother sends me this text message at 8:27am:

What do you know about this big computer outage. Does it effect you?

I immediately checked my usual sources - Boston DevOps, the AWS health portal, the Azure health portal, the GCP health portal - none of the major suspects were reporting anything’s wrong. To Hacker News - where a thread, CrowdStrike Update: Windows Bluescreen and Boot Loops, already had over three hundred replies. By the time I sat down to begin writing this blog post, it had over three thousand replies, and all three major cloud providers’ status pages had information about how to recover from the CrowdStrike incident, in spite of it not being remotely their incident to manage.

What I should have realized was that my mother was looking at human factors, not engineer-y root causes. Most humans don’t go to Hacker News, or to the cloud providers, or to DownDetector. They’re merely trying to live their lives, with the tools that the software industry has provided them. They want to book a plane ticket, get medical care, or pay for parking.

My initial (incorrect) thoughts

CrowdStrike must have released a buggy kernel driver - right? A new package got deployed, and some auto-updater took things from there, for any clients which weren’t fortunate enough to have version-pinned the agent. This was eminently preventable by most IT teams, but as most IT teams don’t have the resources to test every single update on its way out, most simply followed the latest revision.

In short, admins didn’t take steps to guarantee compatibility of new software in a mission-critical environment, and they’re as equally culpable as CrowdStrike themselves.

My understanding now

Disclaimer: I do not claim to have anywhere near the full picture of what happened; rather, I am drawing both on what I’ve read from X / Twitter and other sources, and practical industry experience.

CrowdStrike, much like other vendors, distributes both an application (the CrowdStrike agent) and data (threat definitions). For a more approachable metaphor, think of old-school antivirus software from the 90s and 2000s: you purchased a copy of McAfee VirusScan (or your computer came with it), but you periodically connect to the Internet and download new virus definitions to keep it up to date.

At some point, it becomes beneficial to break these two distributables up into their own pipelines. Users don’t expect to have to upgrade their software every few hours, as new threats are discovered. Likewise, they need updates continuously, as threat actors change their tune - otherwise, the value of the application declines, in every hour that it can’t detect an intrusion.

Most likely: code and data now have different paths to humans. They have different pipelines to prod. And one of those pipelines is optimized for correctness, whereas the other is optimized for speed of delivery.

My thoughts

I still have an old Opscode Chef poster from Velocity 2015, that talks about this brand new DevOps thing. What are some things you want to do? Well, among many others, you want to put your applications and infrastructure through the same workflow. At this point, this is a time-tested adage - our CI/CD pipelines are largely just that - CI/CD pipelines - and not separate CI and CD pipelines.

Applications and infrastructure through the same workflow

I’ll extend this nine-year-old poster to say: caveated that if your data is a distributable artifact, put applications and data through the same pipeline.

I imagine that the CrowdStrike agent has a credible test pipeline. Maybe it launches Windows, macOS, and Linux VMs, and runs tests against the state of the three major OSes on which it runs. Did the agent fail to pick up a known threat? Fail the tests. Did the agent fail to install or update? Fail the pipeline. Did the VM not even boot up? Big fail the pipeline.

But, remember that the entire incentive structure behind separating the code from the data is that the data - those threat definitions - need to release faster than the application. Wouldn’t sending the data through the same workflow as the application just slow down those updates?

Maybe.

I feel that this is a problem that can be architected around. There are numerous great reasons to have a smoke test suite against an application, just as there are many great reasons to have a high percentage of test coverage in its long-running regression suite. Should the application undergo a full regression test before release? Certainly. Should a data update undergo the same scrutiny? Probably not.

Should a data update undergo a reasonable application-level smoke test? Almost definitely.

This will slow down release of the data, for sure! And my ask here is that anyone in this position, again, like my mother, consider the end result - not the technology.

The technology insists we must speed up the pipeline - faster tests mean a quicker pipeline to customers. An easy way to speed up a smoke test is to move some heavier tests to regression. As an industry, we have many tactics like this - shortening test cycles by eliminating valuable tests, eliminating VM-based tests to reduce cloud costs, doubling our story point estimates to increase velocity.

I instead ask that we consider human scales. What’s the value of an extra hour when dealing with threat actors? Quite important! What’s the value of ensuring that you don’t brick your clients’ PCs? Well… if you get that wrong, you may find that your company has lost a quarter of its value overnight, with potentially more to be lost as the dust settled, and your customers - slowly - find alternatives due to your now-tarnished reputation.

Misarguments

With all that’s gone on, I’ve heard a number of statements that seem to miss the point. I’ve distilled and possibly also dramatized them there.

It’s Microsoft’s fault for making Windows so buggy!

Let’s be realistic: it’s not 2000 anymore. Trivial attacks like Code Red and Slammer are by and large no longer possible, and if they are, they’re almost always the result of either state sponsored actors searching for or planting exploits, or genuine oversight and thankless maintainership.

The crash in question from CrowdStrike was almost definitely the result of a non-memory-safe language being used in the kernel, to parse data that came in from user space. The eBPF folks had that concern, too, and developed an elegant solution to avoid the problem. Meanwhile, the Linux kernel is looking to provide a door for Rust to come into the kernel. These problems aren’t unknown, they aren’t unexpected, and they do have solutions.

And I’m sure: if you hired me to write a kernel driver, I’d have done this same thing. I’d have brought down your users’ kernels due to an out-of-bounds error.

This is not Microsoft’s problem. This is CrowdStrike’s problem. It happened, by unlucky coincidence, to pose a problem for Windows users. Unless you fault Microsoft for not going full-steam-ahead on a memory-safe language for drivers, this is not Microsoft’s problem.

Why does everybody only use Windows? Why is Windows a monoculture?

I am not a Microsoft sympathizer, by any means. My home has been a bastion of Linux and BSD (and the one-off Mac provided by my employer) for eight years. I specifically gave up writing C# - a language which I think has a number of great characteristics - for Java, Ruby, and Python. And I’m an avid Firefox user, if not for some long-held allegiance to Netscape and its progeny, than at least because I despise the Chrome monoculture as much as I did the Internet Explorer 6 monoculture, some 20 years ago.

At the same time, I begrudge no one taking a dependency on Windows.

Why would I? Windows is a commercially-supported operating system, developed by a software powerhouse, with some of the longest support cycles in the industry. It has the network effects of hardware manufacturers putting the most effort into writing drivers for it. It has a single, unified set of certifications for how to administer and develop for it, that software development houses can hire for. It has fully integrated offerings to get small firms started, and at the same time has composable, automatable tools just like its Linux-based competitors. All said, nobody ever gets fired for buying ~IBM~ Microsoft.

Would I deploy the same software - willingly - to both Windows and Linux? No. Would I design a system that uses both Windows and Linux concurrently? No. Would I deploy both Windows and Linux at the edge of my corporate infrastructure? Hell no.

Choose the best-in-class solution for the job. If that solution is Linux, I’m happy, and I’ll consider working for you! If that solution is Windows - which it probably is - so be it. I need to put food on the table, too.

Why does everybody only use CrowdStrike?

Here we’ve hit the greatest question of all, and the answer is quite similar to why everybody only uses Windows. Once there’s an established market player, the reasons to deviate are few and far between. Maybe they’re features, maybe they’re cost, but again, nobody will be fired for buying ~IBM~ CrowdStrike.

CrowdStrike, a defensive system, shares some superficial commonalities with Back Orifce, an offensive system designed close to thirty years ago. The key difference is that one is deployed by “hackers”, whereas the other is part of a comprehensive, holistic strategy approved by a CISO.

At the end of the day, though, this software exists to check a box: a node in the network is protected and audited. Frameworks like PCI, HIPAA, and the full list that CrowdStrike claims to have solutions for might be in scope - sometimes in combination. Why not buy into an industry-standard tool that hits on all of them?

And yet, engineering teams may push back. “Nodes, not pods, need endpoint security!” “Shift left - scan containers at build time, or run SAST / DAST tools as part of your pipeline!” are common refrains here. Sadly, all of these technologies don’t yet fit into regulatory frameworks the same way that running applications do.

Anecdata

One of my personal friends is a police officer for a local town. He relayed to me an anecdote where, following a violent arrest, he brought a person to a hospital for a psychological evaluation. He then proceeded to sit with this person, handcuffed, in the hospital waiting room for 45 minutes, while the hospital figured out how to admit him.

There’s a certain romantic belief that, absent technology, we fall back to cruder systems - paper and pencil, adding machines, and the like. This couldn’t be further from the truth. Many modern systems do not plan on a total failure of modern technology.

We do not fall back to what we used to do before computers. We fall back to chaos.

Perhaps fifty years ago, bed availability in a hospital, or seating on an airplane, were managed on paper, using human beings and filing cabinets as semaphores. We don’t magically get transported back to the 1970s when modern systems fail us. Instead, we transition to a completely new state, where we’re expected to provide similar levels of service, and simultaneously have abandoned the tools that would have enabled us to provide 1970s-era standards of service.

It’s not reasonable for us to design paper-and-people DR plans. We should demand more resiliency from the systems that we’re tasked with designing. A single vendor failure should, just like a cache or a database, gracefully degrade. We should indeed not return to the 1970s, nor should we expect that we can continue uninterrupted without a critical third-party vendor.

Conclusion

I’m explicitly ending this post without a conclusion. I instead ask that readers think critically, consider the human factors operating at the receiving end of their systems, and determine how to degrade without failing entirely. Shy away from easy scapegoats like Microsoft and Windows - the true failure modes may lie with us in the software we design, and we ultimately are responsible for assisting continuity when our solutions fail us.

We have two primary tools for accomplishing this: ensuring that our tests adequately capture end results and not just the deliverable, and considering the impact of single-system failures before we deploy a new system.