Endpoint Detection & Response (EDR) , Endpoint Security , Incident & Breach Response
After CrowdStrike Outage: Time to Rebuild Microsoft Windows?
Global Outage Triggers Calls for 'Less-Invasive Access' to Essential FunctionsBefore it was over - even while millions of computers in mid-July were still cycling endlessly into "blue screen of death" spirals of uselessness - the question arose: How could this happen?
See Also: Revealing the Threat Landscape: 2024 Elastic Global Threat Report
The immediate cause was endpoint detection and response vendor CrowdStrike, which has direct access to the Microsoft operating system kernel. It made an update that went horribly wrong, leading to outages affecting airports, banks and hospitals worldwide.
Government agencies, security experts and vendors have flagged multiple areas for review, including the resiliency of Windows operating systems, deployment strategies for third-party software updates and the deep-level OS access many current security tools require.
The July 19 incident disrupted 8.5 million Windows machines. Estimates of the direct losses caused by the outages stand at over $5.4 billion.
CrowdStrike published a root cause analysis of the outage and said it's already making multiple changes to prevent a recurrence, including bolstering its internal testing practices and rolling out software updates in batches (see: CrowdStrike Debuts Safeguards, Seeks to Blunt Outage Impact).
While attention has centered on CrowdStrike, experts say Microsoft shoulders blame for the incident, too. Windows failed to prevent a faulty software update from triggering an infinite reboot cycle.
Redmond has signaled plans to address that. In a July 25 blog post, Microsoft's John Cable, head of program management, said, "Windows must prioritize change and innovation in the area of end-to-end resilience."
On Tuesday, Microsoft plans to host a private, closed-door summit with government and industry representatives at its Redmond, Washington, headquarters to discuss safe deployment strategies and designing more resilient approaches.
Expect the summit to "lead to next steps in both short- and long-term actions and initiatives to pursue, with improved security and resilience as our collective goal," said Aidan Marcuss, corporate vice president of Microsoft Windows and Devices, in a blog post.
Resilience can be easy to define but tough to achieve. As cybersecurity and risk management expert Dan Geer said in a 2014 keynote speech at the Security of Things Forum: "The root source of risk is dependence, especially dependence on the expectation of stable system state."
Geer said everything new added to a previously stable system - for example, via security software updates - involves trade-offs, heightening the tension between resilience and fragility.
Competition Agreement
For Windows, these issues are not just technical matters. They touch on long-standing competition and consumer protection concerns tied to Microsoft, including its browser and software. To resolve those issues, the company reached a 2009 agreement with the European Commission, pledging to give third-party products with which it competes - including on the security front - equal access to Windows internals.
Many types of security software, including extended detection and response - aka XDR - tools, use kernel-level access to get otherwise unavailable information about the system. "This information is incredibly valuable for tracking attacker activity," IT consultancy Forrester Research said in a recent report.
Third-party access to the Windows kernel remains a must-have for many enterprise-focused security tools to correctly function, and removing that capability would lead to "ultimately a much higher cost" due to the corresponding reduction in defensive capabilities, said J.J. Guy, CEO of Sevco Security, which builds IT asset management software.
CrowdStrike has emphasized the continuing need for kernel-level access and for loading its tools as early as possible in the Windows boot cycle. "Products like firmware analysis or device control would not be possible without this design," the company said in its post-incident technical analysis. "Microsoft directly supports and endorses such capabilities in security products, namely through the Early Launch Anti Malware - ELAM - architecture, which was specifically built in Windows 8.1 to enable such types of monitoring and enforcement."
But not all endpoint detection and response tools' kernel-level access is equal, said British cybersecurity expert Kevin Beaumont. "Some of these EDR vendors, including CrowdStrike, publish updates in a way which allows them to run detection code from the kernel in an unsafe way, which can trigger blue screens," he said in a blog post.
Equal Access Concerns
In the name of resilience, Microsoft could propose blocking direct, third-party access to the Windows kernel altogether and require any tools that want to use it to work instead with a Microsoft-architected intermediary software application such as Windows Defender. But experts see this approach as a nonstarter for regulators, on equal access grounds.
"Unless Microsoft is willing to pull Defender out of the kernel space, regulators have a justifiable position," Forrester said.
As highlighted by a number of outages over the years, most recently featuring CrowdStrike, kernel-level access carries risk. This is partially due to some types of XDR software booting before the OS, to prevent attackers from deactivating it. If anything goes wrong, this can prevent the OS from correctly loading, including to the point where it has internet access and could then be automatically or remotely fixed. This is what made the CrowdStrike-triggered outage so difficult for some organizations to quickly remediate. Many IT teams needed to physically access affected systems, sometimes in remote locations.
Following the CrowdStrike-triggered outages, Forrester now recommends IT teams limit in critical systems "the amount of endpoint management agents that have access to the kernel, opting instead for agentless forms of management - i.e., API-based Windows management - or software that does not have kernel components."
Can Windows Be Rearchitected?
Given such concerns, is it time to rebuild Windows to obviate the need for any security software - Microsoft's included - to require kernel-level access?
Germany's cybersecurity agency plans to host a conference later this year aimed at securing commitments from security vendors to move in this direction. At a minimum, the Federal Office for Information Security, known as the BSI, wants Microsoft, CrowdStrike and anyone building comparable security software to ensure "that the respective operating system can always be started at least in safe mode, even in the event of serious malfunctions."
Longer term, the BSI wants to see changes to Windows "offering the same functionality and level of protection as before, but which require less invasive permissions to operating systems," which should better "minimize the impact of software errors," it said.
"It is not acceptable to run these tools in kernel mode with all the access you see today," Thomas Caspers, director general for technology strategy at the BSI, told The Wall Street Journal.
Overhauling Windows to remove the need for kernel-level access would be a major undertaking, but not without precedent. Even "mainframes and mini computers" in the early 1980s could "automatically and effectively" protect kernel memory by "trapping and handling the error," said cybersecurity adviser and former CISO Ken Stephens.
More recently, Linux offers the capability via the Extended Berkeley Packet Filter, or eBPF, which enables kernel code to be executed in a safe space while keeping it isolated in case anything goes wrong. When Apple moved to its own silicon, the company used it as an opportunity to wall off the macOS kernel.
"It's somewhat painful, but it's a necessary evolution," Gartner cybersecurity analyst Neil MacDonald told The Wall Street Journal.
Multiple security vendors have signaled their willingness to leave kernel-level access behind.
Tomer Weingarten, CEO of SentinelOne, which competes with CrowdStrike and Microsoft, told Information Security Media Group that "we would be fully supportive to moving out of the kernel as soon as possible," if Microsoft can come up with a convincing alternative, including "the right interfaces to gain the same type of level of visibility."
For both Linux and macOS systems, he said, "we're already out the kernel."