Global Chaos: Explainer, Fixes, and Takeaways from the CrowdStrike BSOD Outage

Posted by

published

July 19, 2024

TABLE OF CONTENTS

A faulty update from cybersecurity provider CrowdStrike is causing Windows machines worldwide to experience the Blue Screen of Death (BSOD) at boot, affecting numerous businesses including banks, airlines, TV broadcasters, and healthcare providers. Major airlines like Delta and United have grounded flights, while UPS and FedEx are unable to ship packages.

The issue forces Windows PCs and servers with Crowdstrike Falcon installed into a continuous recovery boot loop, preventing them from starting properly. This issue affects not just office laptops and servers, but cloud-based and remote systems like those running in Azure.

CrowdStrike identified the issue as a defective kernel level driver update and has deployed a fix, but impacted machines require manual intervention to recover, due to the nature of the bug.

How to Fix

Fixing this issue is highly dependent on the environment and will likely be labor intensive, as machines are stuck in the boot stage, so a fix cannot be easily "rolled out" to devices, which have largely become inaccessible.

On Individual Devices:

For devices like laptops and PCs, or local devices that a technician can physically access:

Boot Windows into Safe Mode or the Windows Recovery Environment
Navigate to the C:\Windows\System32\drivers\CrowdStrike directory
Locate the file matching “C-00000291*.sys”, and delete it.
Boot the host normally.

These steps are complicated or impossible to run on remote and cloud-based systems, where administrators do not have direct access to the devices, because it takes these devices offline.

As one user on Reddit put it: “I have 40% of the Windows Servers and 70% of client computers stuck in boot loop (totalling over 1,000 endpoints). I don't think CrowdStrike can fix it, right? Whatever new agent they push out won't be received by those endpoints coz they haven't even finished booting.”

Workarounds for Cloud-Based, Data Center, and Remote Systems:

For Remote, Data Center, and Cloud-Based systems, the number one easiest solution is to roll back to a snapshot before 0409 UTC. If you have this option, this is the best way to go. Otherwise, there is some guidance for Azure, AWS, and other remote and virtual systems.

For AWS EC2 instances:

Detach the EBS volume from the impacted EC2
Attach the EBS volume to a new EC2
Fix the Crowdstrike driver folder
Detach the EBS volume from the new EC2 instance
Attach the EBS volume to the impacted EC2 instance

For Azure (via serial):

Login to Azure console
Go to Virtual Machines
Select the affected VM
Click : "Connect"
Click "More ways to Connect"
Click : "Serial Console"
Once SAC has loaded, type in 'cmd' and press enter.
type in : ch -si 1
Press any key (space bar). Enter Administrator credentials
Type the following:
1. bcdedit /set {current} safeboot minimal
2. bcdedit /set {current} safeboot network
Restart the VM
Optional: Confirm the boot state
1. Run command: wmic COMPUTERSYSTEM GET BootupState

For Cloud and Virtual Environments Generally, Including Azure:

Detach the operating system disk volume from the impacted virtual server
Create a snapshot or backup of the disk volume before proceeding further as a precaution against unintended changes
Attach/mount the volume to to a new virtual server
Navigate to the %WINDIR%\\System32\drivers\CrowdStrike directory
Locate the file matching “C-00000291*.sys”, and delete it.
Detach the volume from the new virtual server
Reattach the fixed volume to the impacted virtual server

Again, this will likely take significant, labor-intensive effort to resolve, and may be some time before organizations can get all systems back online and operational. Keep a lookout for updates and alternative means of remediation.

Why is this so widespread?

CrowdStrike Falcon is a cloud-based cybersecurity platform that provides endpoint protection with antivirus, monitoring, and threat intelligence capabilities.

Organizations all over the world install the Falcon agent across their endpoints (devices, servers) to provide comprehensive protection from cyber threats, and to meet security compliance standards. The advantage for these organizations is the ease with which they can meet compliance and risk goals. CrowdStrike offers cloud-based deployment, management, and auto-update capabilities, which makes it very easy.

However, this also makes it painfully clear how this faulty update spread worldwide so quickly.

Once the agent is installed, it is usually left to update itself autonomously, rolling out across all devices in order to remain up-to-date with the latest compliance standards and protect against new risks.

This approach might seem convenient, but it places a significant amount of trust in the software’s ability to manage its updates without introducing vulnerabilities or causing system failures like this. When issues arise, it is easy to blame CrowdStrike. However, this perspective overlooks the broader context, including the organizational practices that led to such deployment strategies.

The Fundamental Problem

CrowdStrike sits in a critical path during computer boot and network connections. Organizations are typically concerned with the ease of the overall system, mitigating their risk, and checking various compliance checkboxes. By leaving all these issues to CrowdStrike, they introduced a severe system vulnerability: a single point of failure. However, because this risk is placed on the 3rd party (CrowdStrike) organizations have effectively mitigated their personal risk and responsibility, which is often preferable.

Reliance on 3rd parties to “just take care of it” has made organizations’ feel less at-risk, because they are no longer responsible. Would you rather have an outage or cyberattack be due to you, or be due to Crowdstrike? However, this outsourcing of risk comes at the expense of real risk, where they have compromised their underlying systems’ robustness and resilience.

The Takeaway

What we are seeing today with CrowdStrike is not unique. In this case, the issue is with Windows and Crowdstrike. While it is significantly more severe due to the difficulty of remediation, we’ve seen numerous examples of outages in the past year coming from other providers like AWS and Cloudflare.

Across various industries, organizations consistently prioritize outsourcing risk and management to 3rd parties, at the expense of real resiliency. While planes are grounded and packages cannot be delivered, no one at these corporations is “to blame”: they can simply point the finger at CrowdStrike, rather than at their own IT practices.

However, these organizations are the ones who created a single point of failure. Homogeneous networks and systems will always be fundamentally vulnerable to these risks. The simple answer is, don't put all of your IT eggs into one basket, whether it’s a cloud provider, network provider, software, or operating system. Don't standardize everything onto a single platform, and at the very least have a backup.

Easier said than done. Heterogeneous and mixed systems are complicated to manage. However, they offer resiliency when it comes to situations like this. A percentage of systems will be affected, but you will not be entirely offline. Solutions like Netmaker, which standardize access and connectivity across heterogeneous environments, help provide a solution to managing these complex systems.

‍

Leveraging Netmaker for Enhanced System Resilience

Netmaker offers a robust solution for managing network complexities, particularly in scenarios where remote and cloud-based systems are affected by unforeseen outages, such as those caused by faulty updates. By leveraging Netmaker's advanced networking capabilities, organizations can establish seamless, encrypted connections across their infrastructure, regardless of geographical location. This feature is crucial for maintaining connectivity and control over remote systems, allowing for efficient troubleshooting and deployment of fixes without requiring physical access to affected machines.

Furthermore, Netmaker's ability to integrate with containerized environments using Docker or Kubernetes allows for greater flexibility and scalability. This integration ensures that organizations can quickly set up secure, reliable networks that are resilient to disruptions. By utilizing Netmaker's professional-grade features, businesses can minimize downtime and maintain operational continuity even when critical systems are impacted. To explore how Netmaker can enhance your network's resilience, sign up and get started today.

Global Chaos: Explainer, Fixes, and Takeaways from the CrowdStrike BSOD Outage

How to Fix

On Individual Devices:

Workarounds for Cloud-Based, Data Center, and Remote Systems:

For AWS EC2 instances:

For Azure (via serial):

For Cloud and Virtual Environments Generally, Including Azure:

Why is this so widespread?

The Fundamental Problem

The Takeaway

‍

Leveraging Netmaker for Enhanced System Resilience

More posts

Set up a Static IP User VPN for Whitelisting, with WireGuard and Netmaker

Mastering ACLs in Netmaker

AWS VPN: Types, Benefits, and Troubleshooting Tips

A WireGuard® VPN that connects machines securely, wherever they are.