At around 5 AM EST, I was violently awakened by my smoke detectors. The first thing that went through my mind was, well, that server I have on is burning the house down. It was not dramatic but just the ever-so-charming “Get up on a ladder and replace my batteries” alarm. Now that I am awake, I look over at my phone to what appears to signify the end of humanity as we know it—outages everywhere, on a global level. Being a network guy, my first thought was, who screwed up a change or which cloud provider is having an issue. However, this global outage was not caused by a network change.
Suppose we look at some of the bigger outages of the past few years. In that case, they have all been isolated as “Misconfiguration of components in the network” or “A routine update to our core caused instability in the following regions….” An outage of this magnitude would almost point to the network, right?
This is bigger than someone messing up a routing policy, a bad configuration push, or inadvertently blackholing traffic. Once identified, these network issues can easily be fixed with a workaround to restore service. This particular outage goes way beyond that and will take longer to remediate. Let’s look at what caused this global outage. For once, a networking change did not cause this; it was an update pushed out by a vendor.
Who is Crowdstrike?
Crowdstrike is Endpoint Detection and Response (EDR) software installed on all enterprise workstations. It constantly monitors the device and responds to cyber threats like ransomware and malware. I will let you insert your jokes here because there are so many. Last night, a content update went out. This is normal with software-based organizations, which do over-the-air updates.
This update was for their threat and response functionality. Part of it was updating kernel-level Windows drivers Crowdstrike uses to secure Windows machines. This is nothing new. Vendors do updates like this to fix bugs, introduce new features, or update the current code base. What they did was nothing unique or new. However, some code caused Windows machines to say, “Nah, man, I’m going to take the weekend off.” Which then turned into a ripple effect of the world returning to the dark ages. Want to jump on that flight home? Nope, flights are grounded. While waiting in the airport, you want to grab something to eat or drink; nope, those organizations are also having issues.
Yes, there is a workaround, but….
A workaround has been published, and the steps are:
- Boot Windows into Safe Mode or the Windows Recovery Environment
- Navigate to the C:\Windows\System32\drivers\CrowdStrike directory.
- Locate the file matching “C-00000291*.sys” and delete it
- Reboot the host
This does not look too bad until you realize…
- You are going to have to touch a lot of workstations to get into safe mode.
- If you use Bitlocker, having the decryption keys handy.
- Do you even have permission to delete the file?
I do not envy anyone working in the desktop or security space over the next few days. This outage will be felt for some time by many bigger organizations.
So, where do we go from here? All organizations will look at their disaster plans and try to develop a plan for if something like this happens again. Unfortunately, that is 100% easier said than done because of its reach. Going back to the network scenario, organizations I worked for had very detailed outage recovery plans, but nothing on a scale like this. An outage like this is almost impossible to prepare for.
How did you find out about devices dropping offline if you were affected? Does this experience make you want to look elsewhere for an EDR platform? Let me know in the comments below or over on my socials; you can visit my linktr.ee to find my socials and LinkedIn.
We have been waiting a while for this network guys….