In the wake of the widespread chaos we saw on Friday, one old adage perhaps feels even truer now than when it was first coined in the 1960s:
To err is human, but to really foul things up you need a computer.
As the world continues to assess the fallout of what has been called “the largest IT outage in history”, industry and government leaders will naturally be pondering how exactly this all could have happened.
Most tragically, the company at the heart of all this – cybersecurity firm CrowdStrike – is explicitly meant to protect the IT systems across our hyperconnected global economy. Is CrowdStrike to blame or were they just unlucky? Could this happen again?
Read more: One small update brought down millions of IT systems around the world. It's a timely warning
For businesses, these are risk management questions as much as they are technical IT questions. Risk is unavoidable in business and life. We can never completely escape it, but we can proactively manage it.
Many big companies hate thinking about and preparing for so-called “black swan” events – major catastrophes that are hard to predict. Friday’s events have shown just how important it is that they do.
Risk isn’t a choice
Businesses face many different types of risks. Of these, Friday’s IT outage was an example of an operational risk event. Operational risk is broadly defined as:
the risk of loss as a result of ineffective or failed internal processes, people, systems, or external events.
In simpler terms, it’s the risk that something goes wrong in the way a business runs.
Friday’s outage instantly wrought havoc on a wide range of technology integrated businesses. It might feel like the kind of event that’s impossible to predict.
But was this operational risk event foreseeable? In general terms – yes! An event like this was inevitable. And it will happen again. Let’s explore some reasons why.
The networked economy
We benefit daily from our networked world, which enables our economy to function at a speed undreamed of decades ago. We depend now on technology for virtually every aspect of our lives.
But this network and speed of activity means when things go wrong, they can go wrong fast, and everywhere. It’s a trade-off decision. If we want the benefits of our data-driven, networked economy, we must accept some risk here.
The trade-off decision extends to the choices made by providers of the upstream software and services we rely upon. This painful lesson was learned by some businesses that had never heard of CrowdStrike last Friday but soon found out key software relied on it. Choosing upstream providers means accepting the risks of their trade-off decisions.
Competition is good, but so are network effects
A fundamental tenet of economics is that competition is good. Yet in technology markets, we often see only a few players dominate. This is in part due to what economists call network externalities.
Positive network externalities arise when increasing the number of users of a product or service increases its value.
Microsoft Windows, for example, is ubiquitous because it has a critical mass of users. Many people know how to use it, which attracts many developers to provide useful applications. Network externalities drive market dominance.
Friday’s events were so wide-reaching because Microsoft and CrowdStrike are dominant players in their respective markets.
Though it wasn’t a Microsoft incident, the company estimated that the outage affected about 8.5 million Windows devices around the world. This is less than 1% of all Windows machines. Microsoft said while this percentage may seem small:
the broad economic and societal impacts reflect the use of CrowdStrike by enterprises that run many critical services.
We have benefited tremendously from the network externalities of these companies’ dominance, at the price of exposing ourselves to the risk of such narrow dependencies.
How to think about risk
Such vulnerabilities don’t mean we can’t still manage these risks. Effective risk management entails the interplay between three factors:
- risk appetite – how much risk we are willing to accept
- understanding the risks we face – keeping an organisational risk register
- investing in risk treatments to keep risks within our appetite.
Risk appetite and understanding varies significantly across different businesses, so too does the extent of investment in treatments.
But the risk of an outage like Friday’s should have been on the risk register of the affected organisations. We can choose our risk appetite and accordingly invest in risk treatments to keep the identified risks within it.
For example, investing in fully redundant systems as a treatment could have limited some of the damage of Friday’s events. Many systems that weren’t using CrowdStrike weren’t directly impacted. Some organisations were able to revert to paper-based systems.
But redundancy in systems is very expensive, and there is always the risk that multiple systems will fail at once.
Risk management is complex. CrowdStrike itself is a risk treatment – for the risk of cyberattacks. Friday’s outage resulted in part from fast patching – a rapid roll out of an update to treat a specific cyberattack risk. In treating one risk, we can expose ourselves to new risks.
Given the consequences of black swan events, effective risk management for such possibilities would seem essential. But businesses can’t prepare for every contingency and so are reluctant to invest now to protect against a future risk event of unknown impact.
It’s a matter of perspective: we need to take a systemic view as we evaluate the trade-offs in our networked economy. Or as Nassim Taleb, author of “The Black Swan” aptly said: “let’s not be turkeys”.
Michael J. Davern has received research grant funding in the past from the Australian Research Council (LP130100106 & LP100100068) in conjunction with National Australia Bank and Great Southern Bank for research in operational risk management practices.
This article was originally published on The Conversation. Read the original article.