What you need to know
- Last week, the world woke up to a global digital pandemic that was caused by a faulty CrowdStrike driver that negatively impacted Microsoft services.
- Dave Plummer, a former Microsoft Windows developer, has shared a YouTube video breaking down intricate details about the global IT outage.
- Plummer believes the outage was caused by null bytes in the driver's dynamic data file.
Last week on Friday, the world woke up to what experts are now referring to as the "biggest IT outage the world has ever seen." The massive outage caused by a faulty CrowdStrike kernel driver impacted approximately 8.5 million Windows devices.
The digital pandemic affected Microsoft services, including networking and cloud computing. The dreadful Blue Screen of Death (BSoD) error characterized the issue. Microsoft and CrowdStrike consequently issued statements indicating that they'd fixed the problem, but it would take time for the services to be fully restored.
Microsoft recommended restarting devices up to 15 times, to expedite the restoration of the services. We've gotten different accounts explaining the root cause of the widespread global IT outage from Microsoft and CrowdStrike officials.
Dave Plummer (aka Dave's Garage), is a former Microsoft software engineer well-known for his contributions across the Windows ecosystem, including adding ZIP file support and more. He recently shared a YouTube video explaining the digital pandemic caused by CrowdStrike.
For context, Plummer's work as a Windows Developer at Microsoft involved debugging BSoD errors, though he admits CrowdStrike's outage was different. Interestingly, Plummer added that he was traveling in New York amidst the chaos, leaving him stranded at the airport.
While Plummer states the process of debugging BSoD errors was straightforward, he refers to issues affecting the kernel driver as "the hardest to sort out." He further indicated that the kernel operating system uses a ring system to bifurcate code into two distinct types of kernel modes for the operating, including kernel mode for the operating system and user mode that facilitates the running of software applications.
Kernel mode is more privileged and has access to the entire system memory map and what's in memory on any physical page. On the other hand, user mode only has access to the memory map pages that the kernel wants you to see. It is also worth noting that when the application code crashes, the application crashes, too. Whereas, when the kernel mode crashes, the entire system crashes — which is why users encounter the dreadful BSoD error.
Plummer notes that this isn't specific to the Windows operating system, but a general protective measure across all modern systems, including Linux and Apple's macOS. This is where CrowdSrike's Falcon security service comes in. It helps keep malware at bay and offers robust protection for servers. It's worth noting that the service operates in kernel mode. It monitors and analyzes how an application runs in case of attacks, since it has access to system data structures and services.
Why was Windows part of the chaos?
Running code in kernel mode is no easy feat. For context, Microsoft offers the WHQL (Windows Hardware Quality Labs) certification in such instances as proof that the drivers have undergone testing and have been certified to run on Windows. The tech giant only issues the digital certificate after running tests. However, it only remains valid as long as no alterations are made to the driver after testing.
CrowdStrike aims to keep Falcon updated with the latest security features to keep up with sophisticated attacks. In essence, this would require the company to create a new driver for each update, which would also mean new WHQL certification for the driver. With the rapid changes and prompt deployment of attacks, this wouldn't be feasible, as the turnaround time could take weeks.
As a bypass to this issue, CrowdStrike features dynamic definition files that the driver can process without necessarily being part of the initial driver package. This way, the company can avoid the long certification and testing processes and still be able to ship new updates to counter potential threats to the system.
However, the approach isn't entirely smooth. The dynamic files are complete programs written in code that the driver can execute and run unsigned code in kernel mode and run. The driver remains the same and doesn't warrant fresh certification, new updates change its operation which in turn presents security threats.
While reports indicate the service crashed because of an invalid memory reference, Plummer suggests the issue could've been caused by null bytes in the dynamic data file. During the massive global IT outage, concerned users threw Microsoft under the bus (though subsequent reports confirmed that the company wasn't behind the digital pandemic).
The Windows operating system ships with many features designed to handle such issues, including booting with the last known good configuration. However, CrowdStrike's driver is marked as a boot driver. For context, boot drivers are crucial when restarting the operating system. Therefore, CrowdStrike's driver may be designated as a boot driver to protect Windows devices — hence the persistent BSoD error reports.