What is AIops (artificial intelligence for IT operations)? Definition, use cases, benefits and landscape

Deploying software to support the work of an enterprise is an increasingly complex job that’s often referred to as 'devops.' When enterprise teams started using artificial intelligence (AI) algorithms to more efficiently and collaboratively run these operations, end users coined the term AIops for these tasks.

What is AIOPS (artificial intelligence for IT operations)?

AI can help large software installations by watching the software run and flag any anomalies or instances of poor performance. The software can examine logs and track key metrics, like response time, to evaluate the speed and effectiveness of the code. When the values deviate, the AI can suggest solutions and even implement some of them.

There are several stages to the process:

AIops is growing in complexity as teams deploy algorithms to a variety of enterprises. One of the most valuable opportunities comes when organizations start to use other AI algorithms in daily operations. In these cases, AIops can help with deploying AI. This way, there can be synergy between the software layers.

Sometimes AIops teams use other subterms for their work. MLops, for example, deals specifically with using and deploying machine learning algorithms. DataOps can refer to the general problem of collecting data or the more specific problem of organizing the data that’s used to train and refresh an artificial intelligence model.

Also read: MLops vs. devops: Why data makes it different

Benefits of AI in IT operations (AIOPS)

When AI scientists began to explore the best algorithms for AI, they worked with experimental computers in their labs. Now that AI is becoming regularly deployed in production environments, some are beginning to specialize in maintaining and running software.

The challenges of supplying services with AI algorithms are the same as maintaining regular software. There should be sufficient computational power to answer all requests, even those that arrive together in a moment of peak demand. There should be systems in place to deliver the right versions of the software to the front-line hardware. When developers and scientists make changes, there should be a mechanism for testing them and eventually replacing the software on the front-line machine with the newest version.

While much of the work is no different from standard devops. However, there are also concerns that are particular to AI and machine learning (ML). Some of these include:

All of these questions and strategies apply in some form to the subsets with names like DataOps, MLops, ModelOps, and PlatformOps because they focus on some of the particular parts of the work.

Also read: From ‘Star Wars’ to streaming wars: How AIops is fueling the intergalactic streaming battle

Is AIops about AI or IT?

Some companies focus on using AI to improve performance of their servers and databases. They use the term AIops to refer to using AI algorithms to watch for anomalies and, perhaps, predict outages or failures before they happen. The algorithms are good at creating models of expected performance and then creating alerts when the stack starts to perform differently.

The AI algorithms are particularly useful for noticing security failures. They can, for instance, flag large outflows of data from hackers that stand out because users typically only download a small amount of data that fits their need. Unusual data flows are typically indicators of a breach.

Now that AI routines are becoming more common and integrated to all parts of the stack, some firms are asking how they can support the ongoing work specific to AI tools. That is, juggling the datasets, constructing the models, deploying the models and then rotating them to maintain performance.

How can AIops help security?

While many areas of AIops are focused on practical issues of performance like how quickly a server is responding to a request, some are also using AI algorithms to watch for the kind of anomalies that indicate a leak or unauthorized intrusion.

A few of the simplest ways that AIops can help with cybersecurity is to watch for large or uncharacteristic outflows of data. If the website is designed to offer small, quick answers with at most one user’s personal information, then a larger block could signal a mistake.

Some areas that AIops may watch are:

This approach can be especially useful because security breaches are usually quite rare and difficult for a human to spot. An algorithm can watch thousands of machines and spot the one where the load or the behavior is out of the ordinary.

AIops algorithms will also adapt with time. The models can be trained and retrained as the workloads shift. This can be useful because some attacks rely upon reactivating older software that is no longer used. For instance, the models can spot that some access mechanisms aren’t in common use and flag them.

How are the major enterprises handling AIops?

The dominant cloud and service providers all have regular services for exploring and deploying AI. The services began simply, but as users have begun relying upon AI algorithms for production work, the companies have been expanding their services to also offer maintaining datasets and models as necessary.

The dominant players are also adding special hardware configurations aimed at delivering AI solutions cheaply as possible. Some are building custom hardware that can speed up processing, often dramatically.

Amazon, for example, developed a custom chip called Inferentia to speed up AI deployments. The chip is optimized for applying a model to the current set of data, a step that is often done many more times than training. The Inferentia is said to be 70% cheaper than using one of AWS’s regular GPU-enabled instances.

IBM has added AIops to its Cloud Pak for Watson, so the software supports continual delivery of AI-based decisions. The tool helps the team monitoring the AI watch for anomalies and adverse incidents. Intelligent Root Cause Analysis is designed so that the company can understand why decisions are being made, either correctly or incorrectly.

Google maintains a line of specialized chips for ML that they call TPUs or Tensor Processing Units that can offer faster speeds and lower costs for AIops. They also created a platform called TensorFlow Enterprise to support teams that are using the TensorFlow open-source software in production work. The tool helps teams both explore the power of the algorithms and also deploy work quickly to hardware in Google’s cloud.

Microsoft has integrated its AI solutions with many of its products. It’s not uncommon to find that the simplest way to work with AI is as a feature for some of its web tools like Dynamics 365, a business management platform. They’re also planning the best solutions for continual delivery of ML solutions with tools like Gandalf, a system that integrates testing with deployment so rollouts of new models and software is safe and curated.

Nvidia, the major manufacturer of graphics processing units, also supports many cloud options for training and deploying AI models through its CUDA architecture. The company continues to support all clouds that are using Nvidia hardware with a collection of tools like Launchpad.

Also read: AIops lessons learned: Be careful when selecting a vendor

What about AIops startups?

Many of the companies that specialize in devops and ITops also support AI algorithms as well. The same mechanisms that can detect a failed database or an overloaded server can also detect a problematic server that’s executing an AI routine. Good operations tools can solve many problems that confound AI.

Companies like NewRelic, DataDog, Splunk, PagerDuty, BigPanda, Turbonomic and DynaTrace are just a few of the leading firms that help track the performance of servers and software. They create event logs from an IT stack and make it available in an easily accessible, often graphical, format. Their dashboards and other tools work well for tracking performance.

AIops D is a startup designed to roll out microservices that may rely on AI to automate some of its goals. The company, started by Deloitte, also offers consulting services to help create some right microservices to tackle business needs. The goal is to produce a set of largely automated services that handle all of the business processes.

Companies like Databricks and DataRobot are building clouds that gather data and then apply the best AI algorithms to create models. They began as data warehouses or data lakes and evolved to support sophisticated analysis.

Is there anything that AIops can’t do?

AIops platforms tackle a variety of problems but they are only as good as their data. If the data is noisy, inaccurate or full of gaps, the analysis will be less accurate and sometimes completely wrong. Much of the work begins before analysis, when the data is collected.

Analyzing events that are unusual can be a challenge. In some cases, AIops platforms are just tasked with flagging anomalous events. In these cases, strange patterns that don’t match the historical data are easy to identify.

But in other cases, the AIops platform is expected to create predictions about the future. In these cases, strange or unusual events can produce wrong results. If the AI model is built from the record and it learns how to behave by studying the past, then a new, unusual event will be something it can’t handle because it has no context or history for guidance.

When the AIops platform helps manage AI models and data gathering, the work of AIops can only support the AI algorithms by making it easier to create new models. It can’t make the algorithms more accurate. AIops can just handle the housekeeping chores.

Read next: How AIops can benefit businesses