Global Chaos: Explainer, Fixes, and Takeaways from the CrowdStrike BSOD Outage

Posted by
published
July 19, 2024
TABLE OF CONTENTS
Get Secure Remote Access with Netmaker
Sign up for a 2-week free trial and experience seamless remote access for easy setup and full control with Netmaker.

A faulty update from cybersecurity provider CrowdStrike is causing Windows machines worldwide to experience the Blue Screen of Death (BSOD) at boot, affecting numerous businesses including banks, airlines, TV broadcasters, and healthcare providers. Major airlines like Delta and United have grounded flights, while UPS and FedEx are unable to ship packages. 

The issue forces Windows PCs and servers with Crowdstrike Falcon installed into a continuous recovery boot loop, preventing them from starting properly. This issue affects not just office laptops and servers, but cloud-based and remote systems like those running in Azure.

CrowdStrike identified the issue as a defective kernel level driver update and has deployed a fix, but impacted machines require manual intervention to recover, due to the nature of the bug.

How to Fix

Fixing this issue is highly dependent on the environment and will likely be labor intensive, as machines are stuck in the boot stage, so a fix cannot be easily "rolled out" to devices, which have largely become inaccessible.

On Individual Devices:

For devices like laptops and PCs, or local devices that a technician can physically access:

  1. Boot Windows into Safe Mode or the Windows Recovery Environment
  2. Navigate to the C:\Windows\System32\drivers\CrowdStrike directory
  3. Locate the file matching “C-00000291*.sys”, and delete it.
  4. Boot the host normally.

These steps are complicated or impossible to run on remote and cloud-based systems, where administrators do not have direct access to the devices, because it takes these devices offline. 

As one user on Reddit put it: “I have 40% of the Windows Servers and 70% of client computers stuck in boot loop (totalling over 1,000 endpoints). I don't think CrowdStrike can fix it, right? Whatever new agent they push out won't be received by those endpoints coz they haven't even finished booting.”

Workarounds for Cloud-Based, Data Center, and Remote Systems:

For Remote, Data Center, and Cloud-Based systems, the number one easiest solution is to roll back to a snapshot before 0409 UTC. If you have this option, this is the best way to go. Otherwise, there is some guidance for Azure, AWS, and other remote and virtual systems.

For AWS EC2 instances:

  1. Detach the EBS volume from the impacted EC2
  2. Attach the EBS volume to a new EC2
  3. Fix the Crowdstrike driver folder
  4. Detach the EBS volume from the new EC2 instance
  5. Attach the EBS volume to the impacted EC2 instance

For Azure (via serial):

  1. Login to Azure console
  2. Go to Virtual Machines
  3. Select the affected VM
  4. Click : "Connect"
  5. Click "More ways to Connect"
  6. Click : "Serial Console"
  7. Once SAC has loaded, type in 'cmd' and press enter.
  8. type in : ch -si 1
  9. Press any key (space bar). Enter Administrator credentials
  10. Type the following:
    1. bcdedit /set {current} safeboot minimal
    2. bcdedit /set {current} safeboot network
  11. Restart the VM
  12. Optional: Confirm the boot state
    1. Run command: wmic COMPUTERSYSTEM GET BootupState

For Cloud and Virtual Environments Generally, Including Azure:

  1. Detach the operating system disk volume from the impacted virtual server
  2. Create a snapshot or backup of the disk volume before proceeding further as a precaution against unintended changes
  3. Attach/mount the volume to to a new virtual server
  4. Navigate to the %WINDIR%\\System32\drivers\CrowdStrike directory
  5. Locate the file matching “C-00000291*.sys”, and delete it.
  6. Detach the volume from the new virtual server
  7. Reattach the fixed volume to the impacted virtual server

Again, this will likely take significant, labor-intensive effort to resolve, and may be some time before organizations can get all systems back online and operational. Keep a lookout for updates and alternative means of remediation.

Why is this so widespread?

CrowdStrike Falcon is a cloud-based cybersecurity platform that provides endpoint protection with antivirus, monitoring, and threat intelligence capabilities.

Organizations all over the world install the Falcon agent across their endpoints (devices, servers) to provide comprehensive protection from cyber threats, and to meet security compliance standards. The advantage for these organizations is the ease with which they can meet compliance and risk goals. CrowdStrike offers cloud-based deployment, management, and auto-update capabilities, which makes it very easy.

However, this also makes it painfully clear how this faulty update spread worldwide so quickly. 

Once the agent is installed, it is usually left to update itself autonomously, rolling out across all devices in order to remain up-to-date with the latest compliance standards and protect against new risks.

This approach might seem convenient, but it places a significant amount of trust in the software’s ability to manage its updates without introducing vulnerabilities or causing system failures like this. When issues arise, it is easy to blame CrowdStrike. However, this perspective overlooks the broader context, including the organizational practices that led to such deployment strategies.

The Fundamental Problem

CrowdStrike sits in a critical path during computer boot and network connections. Organizations are typically concerned with the ease of the overall system, mitigating their risk, and checking various compliance checkboxes. By leaving all these issues to CrowdStrike, they introduced a severe system vulnerability: a single point of failure. However, because this risk is placed on the 3rd party (CrowdStrike) organizations have effectively mitigated their personal risk and responsibility, which is often preferable.

Reliance on 3rd parties to “just take care of it” has made organizations’ feel less at-risk, because they are no longer responsible. Would you rather have an outage or cyberattack be due to you, or be due to Crowdstrike? However, this outsourcing of risk comes at the expense of real risk, where they have compromised their underlying systems’ robustness and resilience.

The Takeaway

What we are seeing today with CrowdStrike is not unique. In this case, the issue is with Windows and Crowdstrike. While it is significantly more severe due to the difficulty of remediation, we’ve seen numerous examples of outages in the past year coming from other providers like AWS and Cloudflare. 

Across various industries, organizations consistently prioritize outsourcing risk and management to 3rd parties, at the expense of real resiliency. While planes are grounded and packages cannot be delivered, no one at these corporations is “to blame”: they can simply point the finger at CrowdStrike, rather than at their own IT practices.

However, these organizations are the ones who created a single point of failure. Homogeneous networks and systems will always be fundamentally vulnerable to these risks. The simple answer is, don't put all of your IT eggs into one basket, whether it’s a cloud provider, network provider, software, or operating system. Don't standardize everything onto a single platform, and at the very least have a backup. 

Easier said than done. Heterogeneous and mixed systems are complicated to manage. However, they offer resiliency when it comes to situations like this. A percentage of systems will be affected, but you will not be entirely offline. Solutions like Netmaker, which standardize access and connectivity across heterogeneous environments, help provide a solution to managing these complex systems.

‍

Get Secure Remote Access with Netmaker
Sign up for a 2-week free trial and experience seamless remote access for easy setup and full control with Netmaker.
More posts

GET STARTED

A WireGuard® VPN that connects machines securely, wherever they are.
Star us on GitHub
Can we use Cookies?  (see  Privacy Policy).