How ML can solve root cause application failure mysteries for engineering and support teams

This article was contributed by Ajay Singh, founder and CEO of Zebrium.

Software sometimes breaks — whether in the cloud, in a hardware appliance, or in infrastructure like networking and security. That is an inevitable fact of life, mainly due to frequent code updates, combined with complexity and countless usage variables. A problem with an application becomes costly for companies and can even threaten the loss of customers, terminated shopping carts or marred reputation.

The six-hour Facebook outage in October 2021 resulted in losses of $164,000 per minute and cut the company’s market cap by some $40 billion. The December 2021 AWS outage wreaked havoc across the U.S. Banks, service companies and other retailers suffer considerable losses when mobile apps or web applications fail. Outages and problems are extremely costly, so fixing them quickly is paramount. The pressure is on, and the clock is ticking. Unfortunately, finding the root cause of these failures is rarely straightforward and often involves considerable detective work.

In the case of the fall Facebook outage, Downdetector tweeted that it was “the largest outage we’ve ever seen on Downdetector with over 10.6 million problem reports from all over the globe.” The outage was finally identified as a configuration change problem. According to the Uptime Institute 2020 outage analysis report, outages are becoming more severe and costly. At the same time, remedying them is getting more complex as features grow and dependencies on things like software microservices and cloud infrastructure proliferate.

To find the root cause, in an ideal world, engineers and support teams would have continuous streams of logs, unlimited time to analyze them, and an understanding of the problem they are about to troubleshoot, but this is rarely the case. Often, they receive a bundle of log files after the fact, without any other context or understanding of the problem. Then they are told to put their detective skills to work. Since these files are frequently just a snapshot from a period of a few hours on the day of the incident, establishing an understanding of what went wrong can seem like a daunting task, an unsolvable mystery.

Thanks to some very clever machine learning (ML) techniques, however, even a static bundle of logs can quickly yield the answers. ML-driven root cause analysis can identify patterns and correlations that might not be obvious to the naked eyes of a support engineer and uncover the cause of an incident much faster than through manual analysis. Not only does this increase the speed of resolution, but it also improves team productivity and efficiency.

In most cases, the challenge of finding root cause is complicated by the sheer size and number of logs, their messy and unstructured nature and the lack of clarity over what one is trying to find. All of these factors favor ML, not because the task is impossible for trained personnel, but because ML works faster than human eyes and scales beyond the limits of available human resources.

When troubleshooting by analyzing logs, skilled engineers typically start by looking across the logs for rare and unexpected log events and correlating them with errors. The larger the volume of logs and data, the more difficult it is for humans and the greater the value proposition of using ML. The difficulty of the task grows as one moves from reviewing voluminous data to then finding anomalies and making correlations that provide meaningful insight. With ML, each step can be accomplished autonomously and can easily be scaled to almost any volume of data.

ML is also better suited for determining the real root cause of a problem. In a race against time and with team resource constraints, engineers and support personnel will frequently find a quick remedy or workaround rather than identify and address its true root cause. This often means the same problem will occur again and can impact many other customers as well. However, when ML is used to uncover the root cause, engineering can use their limited time to work directly on addressing the source of the problem and prevent it from having an ongoing impact.

Of course, ML is not a panacea for the entirety of application support. Trained professionals still need to review the ML findings and conduct the proper remediation. While much of the overall process can now be automated, it leaves team members to apply their expertise in the most important task – the “last mile.” The result of using ML speeds the entire process, boosts team efficiency and leaves professionals with more time to work on important tasks.

With complexities of applications and environments continually increasing and demands on support organizations mounting, introducing ML for logs to the application support process is quickly moving from a luxury to a necessity.

Ajay Singh is the founder and CEO of Zebrium.

Welcome to the VentureBeat community!

Our guest posting program is where technical experts share insights and provide neutral, non-vested deep dives on AI, data infrastructure, cybersecurity and other cutting-edge technologies shaping the future of enterprise.

Read more from our guest post program — and check out our guidelines if you’re interested in contributing an article of your own!

More