The Internet has become the new backbone of corporate operations. Delivery services, banking, VPN connections from anywhere and everywhere - it’s all powered by the Internet as the new delivery mechanism for customer and employee applications and services.
Guaranteeing access to an always-on experience of those digital services is essential to any business. While predictive analytics and AI-powered intelligence allow us to build forecasting models that help optimize performance and minimize downtime, outages on the Internet can permutate in an infinite number of ways. But with Internet outages occurring on external networks and within third-party providers that sit outside the owned IT perimeter, how do you attempt to identify these outages, let alone predict and mitigate them?
No longer a finite number of events
An outage will present to the end user in a standard set of ways, including slower load times or a complete inability to access an application or service. Often, there’s commonality in the underlying pattern - or the chain of events - that led to that outage occurring.
In isolation, each pattern is detectable and observable. Most IT teams will conduct a post-incident analysis to map out the pattern or sequence of events that led to an outage. This helps understand the chain reaction of events in detail such that if the same pattern was to repeat in the future, it can be detected and an intervention made before it ends in a disruption that impacts users.
The challenge facing operations teams today is that things are no longer this simple, and outages are no longer based on a finite number of isolated events.
The multi-layered and multi-interdependent outage
Networks and applications have grown in complexity and this has influenced the characteristics of outages. In particular, the underlying patterns of system behavior that cause outages aren’t as repetitively predictable as they once were. Outage causes today are a lot more intricate and harder to diagnose. For instance a system or application no longer follows a linear client-network-server architecture; instead, it operates as a “mesh” of connectivity links, IT infrastructure and software components. The challenge for ops teams is that a mesh architecture dramatically increases the number of interconnected components and therefore permutations of conditions that can cause an outage. Compared to a more linear architecture, connections between components in the mesh and the number of permutations or sequences that can form an outage pattern are both exponentially higher.
In addition, the number of components in the mesh is also ever-changing. As more features are added to an application, more components or third-party services are incorporated into the application’s end-to-end delivery chain - and into the mesh that supports it. The complexity of the application grows, and so does the range of potential causes that can bring part or all of the application down. And it’s not just the direct dependencies that are a concern; third-party infrastructure services and components come with their own interdependencies, with systems and services that are often several steps removed from view.
Is an unpredictable pattern even a pattern?
These outage patterns don’t manifest in predictable ways.
To have the best possible chance of accurately pattern-matching in this scenario, organizations need a reliable way to ‘read between the lines’ - to understand the intricate interplay of events and patterns being observed, and how that contextually relates to the performance of their specific application or infrastructure.
That level of contextual insight across any and all domains, even the ones that sit outside of enterprise visibility and control, demands a new approach to how we think about outage detection and mitigation.
Managing such a globally vast network, then, that includes networks and domains outside of enterprise control, requires a new approach to the level of data and contextual insight that IT leaders now need to care about.
When data-driven insight goes beyond enterprise scale
When it comes to seeing, but also predicting outages, it is access to high-fidelity data across all environments that matter - including cloud and the Internet - that will ultimately allow us to identify and navigate this new world of patterns on patterns on patterns, surfacing where a performance problem exists, why, and if it matters. Visibility across the end-to-end service delivery chain to see, correlate, and triangulate all patterns that matter to assuring always-on digital experiences.
So while outage patterns may permutate in perpetuity, so too are new technologies accelerating beyond human scale. Done right, new technologies will allow us to see outages within and beyond our perimeter, wherever they may occur, as well as power a new level of intelligence to generate the automated insight required to forecast all the different patterns—and the recommended action to avoid them.
We've featured the best cloud backup.
This article was produced as part of TechRadarPro's Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro