Were you unable to attend Transform 2022? Check out all of the summit sessions in our on-demand library now! Watch here.
Table of contents
Deploying software to support the work of an enterprise is an increasingly complex job that’s often referred to as ‘devops.’ When enterprise teams started using artificial intelligence (AI) algorithms to more efficiently and collaboratively run these operations, end users coined the term AIops for these tasks.
AI can help large software installations by watching the software run and flag any anomalies or instances of poor performance. The software can examine logs and track key metrics, like response time, to evaluate the speed and effectiveness of the code. When the values deviate, the AI can suggest solutions and even implement some of them.
There are several stages to the process:
- Detection or observability: The software absorbs as many metrics and event logs as possible. The focus is generally on poor performance that can affect users directly, like a 404 error or an especially long database query run time. Some systems, though, may watch for other issues like a failed sensor or an overheated device.
- Predictive analytics: After collecting data for some time, AIops software can begin to identify precursors that can often signal an upcoming failure. The AI algorithms are optimized to look for correlations between values, especially those that are anomalies that may indicate upcoming problems.
- Proactive mitigation: Some AIops algorithms can be tuned to respond immediately to potential problems when the solution is straightforward. For example, a crashing service may be rebooted or reinitialized with more RAM. When these solutions work, they can eliminate much of the problem and save end users from encountering failures.
AIops is growing in complexity as teams deploy algorithms to a variety of enterprises. One of the most valuable opportunities comes when organizations start to use other AI algorithms in daily operations. In these cases, AIops can help with deploying AI. This way, there can be synergy between the software layers.
Sometimes AIops teams use other subterms for their work. MLops, for example, deals specifically with using and deploying machine learning algorithms. DataOps can refer to the general problem of collecting data or the more specific problem of organizing the data that’s used to train and refresh an artificial intelligence model.
How can AIops support deployment of AI?
When AI scientists began to explore the best algorithms for AI, they worked with experimental computers in their labs. Now that AI is becoming regularly deployed in production environments, some are beginning to specialize in maintaining and running software.
The challenges of supplying services with AI algorithms are the same as maintaining regular software. There should be sufficient computational power to answer all requests, even those that arrive together in a moment of peak demand. There should be systems in place to deliver the right versions of the software to the front-line hardware. When developers and scientists make changes, there should be a mechanism for testing them and eventually replacing the software on the front-line machine with the newest version.
While much of the work is no different from standard devops. However, there are also concerns that are particular to AI and machine learning (ML). Some of these include:
- The model is like another piece of software with its own version number and history. The AIops team will juggle models, often independently of the software itself.
- Training the model is often a time-consuming process that often requires an elaborate build process of its own.
- There are now different chips that are optimized in different ways for creating the model and running the model in production. AIops teams must plan the best available hardware for each task independently.
- The build process may involve much more experimentation than typical software development. It’s not uncommon for AIops teams to try different arrangements for neural networks and then evaluate how they perform.
- AIops teams may also have a third job of tracking the datasets that are used for training and evaluation. These datasets may also evolve with their own version numbers and history.
- Some applications deliberately feed data back into the training set over time, so the set grows and the results improve. AIops teams must also maintain the evolution of the training data over time.
- Some AI applications require screening results for potential bias. AIops teams can watch the working results for potential problems.
All of these questions and strategies apply in some form to the subsets with names like DataOps, MLops, ModelOps, and PlatformOps because they focus on some of the particular parts of the work.
Is AIops about AI or IT?
Some companies focus on using AI to improve performance of their servers and databases. They use the term AIops to refer to using AI algorithms to watch for anomalies and, perhaps, predict outages or failures before they happen. The algorithms are good at creating models of expected performance and then creating alerts when the stack starts to perform differently.
The AI algorithms are particularly useful for noticing security failures. They can, for instance, flag large outflows of data from hackers that stand out because users typically only download a small amount of data that fits their need. Unusual data flows are typically indicators of a breach.
Now that AI routines are becoming more common and integrated to all parts of the stack, some firms are asking how they can support the ongoing work specific to AI tools. That is, juggling the datasets, constructing the models, deploying the models and then rotating them to maintain performance.
How can AIops help security?
While many areas of AIops are focused on practical issues of performance like how quickly a server is responding to a request, some are also using AI algorithms to watch for the kind of anomalies that indicate a leak or unauthorized intrusion.
A few of the simplest ways that AIops can help with cybersecurity is to watch for large or uncharacteristic outflows of data. If the website is designed to offer small, quick answers with at most one user’s personal information, then a larger block could signal a mistake.
Some areas that AIops may watch are:
- Outflows from servers that don’t normally respond or send packets to machines outside the company.
- Atypical SQL queries that are new or rarely seen.
- Atypical requests for encryption keys.
- Responses that are encrypted even though they normally aren’t or vice versa.
- Unusual load at unusual times. For example, a heavy number of requests in the middle of the night when everyone is normally asleep.
This approach can be especially useful because security breaches are usually quite rare and difficult for a human to spot. An algorithm can watch thousands of machines and spot the one where the load or the behavior is out of the ordinary.
AIops algorithms will also adapt with time. The models can be trained and retrained as the workloads shift. This can be useful because some attacks rely upon reactivating older software that is no longer used. For instance, the models can spot that some access mechanisms aren’t in common use and flag them.
How are the major enterprises handling AIops?
The dominant cloud and service providers all have regular services for exploring and deploying AI. The services began simply, but as users have begun relying upon AI algorithms for production work, the companies have been expanding their services to also offer maintaining datasets and models as necessary.
The dominant players are also adding special hardware configurations aimed at delivering AI solutions cheaply as possible. Some are building custom hardware that can speed up processing, often dramatically.
Amazon, for example, developed a custom chip called Inferentia to speed up AI deployments. The chip is optimized for applying a model to the current set of data, a step that is often done many more times than training. The Inferentia is said to be 70% cheaper than using one of AWS’s regular GPU-enabled instances.
IBM has added AIops to its Cloud Pak for Watson, so the software supports continual delivery of AI-based decisions. The tool helps the team monitoring the AI watch for anomalies and adverse incidents. Intelligent Root Cause Analysis is designed so that the company can understand why decisions are being made, either correctly or incorrectly.
Google maintains a line of specialized chips for ML that they call TPUs or Tensor Processing Units that can offer faster speeds and lower costs for AIops. They also created a platform called TensorFlow Enterprise to support teams that are using the TensorFlow open-source software in production work. The tool helps teams both explore the power of the algorithms and also deploy work quickly to hardware in Google’s cloud.
Microsoft has integrated its AI solutions with many of its products. It’s not uncommon to find that the simplest way to work with AI is as a feature for some of its web tools like Dynamics 365, a business management platform. They’re also planning the best solutions for continual delivery of ML solutions with tools like Gandalf, a system that integrates testing with deployment so rollouts of new models and software is safe and curated.
Nvidia, the major manufacturer of graphics processing units, also supports many cloud options for training and deploying AI models through its CUDA architecture. The company continues to support all clouds that are using Nvidia hardware with a collection of tools like Launchpad.
What about AIops startups?
Many of the companies that specialize in devops and ITops also support AI algorithms as well. The same mechanisms that can detect a failed database or an overloaded server can also detect a problematic server that’s executing an AI routine. Good operations tools can solve many problems that confound AI.
Companies like NewRelic, DataDog, Splunk, PagerDuty, BigPanda, Turbonomic and DynaTrace are just a few of the leading firms that help track the performance of servers and software. They create event logs from an IT stack and make it available in an easily accessible, often graphical, format. Their dashboards and other tools work well for tracking performance.
AIops D is a startup designed to roll out microservices that may rely on AI to automate some of its goals. The company, started by Deloitte, also offers consulting services to help create some right microservices to tackle business needs. The goal is to produce a set of largely automated services that handle all of the business processes.
Companies like Databricks and DataRobot are building clouds that gather data and then apply the best AI algorithms to create models. They began as data warehouses or data lakes and evolved to support sophisticated analysis.
Is there anything that AIops can’t do?
AIops platforms tackle a variety of problems but they are only as good as their data. If the data is noisy, inaccurate or full of gaps, the analysis will be less accurate and sometimes completely wrong. Much of the work begins before analysis, when the data is collected.
Analyzing events that are unusual can be a challenge. In some cases, AIops platforms are just tasked with flagging anomalous events. In these cases, strange patterns that don’t match the historical data are easy to identify.
But in other cases, the AIops platform is expected to create predictions about the future. In these cases, strange or unusual events can produce wrong results. If the AI model is built from the record and it learns how to behave by studying the past, then a new, unusual event will be something it can’t handle because it has no context or history for guidance.
When the AIops platform helps manage AI models and data gathering, the work of AIops can only support the AI algorithms by making it easier to create new models. It can’t make the algorithms more accurate. AIops can just handle the housekeeping chores.
Read next: How AIops can benefit businesses