Harnessing the power of machine learning with MLOps

MLOps, a compound of "machine learning" and "information technology operations," is a newer discipline involving collaboration between data scientists and IT professionals with the aim of productizing machine learning algorithms. The market for such solutions could grow from a nascent $350 million to $4 billion by 2025, according to Cognilytica. But certain nuances can make implementing MLOps a challenge. A survey by NewVantage Partners found that only 15% of leading enterprises have deployed AI capabilities into production at any scale.

Still, the business value of MLOps can't be ignored. A robust data strategy enables enterprises to respond to changing circumstances, in part by frequently building and testing machine learning technologies and releasing them into production. MLOps essentially aims to capture and expand on previous operational practices while extending these practices to manage the unique challenges of machine learning.

What is MLOps?

MLOps, which was born at the intersection of DevOps, data engineering, and machine learning, is similar to DevOps but differs in execution. MLOps combines different skill sets: those of data scientists specializing in algorithms, mathematics, simulations, and developer tools and those of operations administrators who focus on tasks like upgrades, production deployments, resource and data management, and security.

One goal of MLOps is to roll out new models and algorithms seamlessly, without incurring downtime. Because production data can change due to unexpected events and machine learning models respond well to previously seen scenarios, frequent retraining -- or even continuous online training -- can make the difference between an optimal and suboptimal prediction.

A typical MLOps software stack might span data sources and the datasets created from them, as well as a repository of AI models tagged with their histories and attributes. Organizations with MLOps operations might also have automated pipelines that manage datasets, models, experiments, and software containers -- typically based on Kubernetes -- to make running these jobs simpler.

At Nvidia, developers running jobs on internal infrastructure must perform checks to guarantee they're adhering to MLOps best practices. First, everything must run in a container to consolidate the libraries and runtimes necessary for AI apps. Jobs must also launch containers with an approved mechanism and run across multiple servers, as well as showing performance data to expose potential bottlenecks.

Another company embracing MLOps, software startup GreenStream, incorporates code dependency management and machine learning model testing into its development workflows. GreenStream automates model training and evaluation and leverages a consistent method of deploying and serving each model while keeping humans in the loop.

Implementing MLOps

Given all the elements involved with MLOps, it isn't surprising that companies adopting it often run into roadblocks. Data scientists have to tweak various features -- like hyperparameters, parameters, and models -- while managing the codebase for reproducible results. They also need to engage in model validation, in addition to conventional code tests, including unit testing and integration testing. And they have to use a multistep pipeline to retrain and deploy a model -- particularly if there's a risk of reduced performance.

When formulating an MLOps strategy, it helps to begin by framing machine learning objectives from business growth objectives. These objectives, which typically come in the form of KPIs, can have certain performance measures, budgets, technical requirements, and so on. From there, organizations can work toward identifying input data and the kinds of models to use for that data. This is followed by data preparation and processing, which includes tasks like cleaning data and selecting relevant features (i.e., the variables used by the model to make predictions).

The importance of data selection and prep can't be overstated. In a recent Atlation survey, a clear majority of employees pegged data quality issues as the reason their organizations failed to successfully implement AI and machine learning. Eighty-seven percent of professionals said inherent biases in the data being used in their AI systems produce discriminatory results that create compliance risks for their organizations.

At this stage, MLOps extends to model training and experimentation. Capabilities like version control can help keep track of data and model qualities as they change throughout testing, as well as helping scale models across distributed architectures. Once machine learning pipelines are built and automated, deployment into production can proceed, followed by the monitoring, optimization, and maintenance of models.

A critical part of monitoring models is governance, which here means adding control measures to ensure the models deliver on their responsibilities. A study by Capgemini found that customers and employees will reward organizations that practice ethical AI with greater loyalty, more business, and even a willingness to advocate for them -- and will punish those that don't. The study suggests companies that don’t approach the issue thoughtfully can incur both reputational risk and a direct hit to their bottom line.

The benefits of MLOps

In sum, MLOps applies to the entire machine learning lifecycle, including data gathering, model creation, orchestration, deployment, health, diagnostics, governance, and business metrics. If successfully executed, MLOps can bring business interest to the fore of AI projects while allowing data scientists to work with clear direction and measurable benchmarks.

Enterprises that ignore MLOps do so at their own peril. There's a shortage of data scientists skilled at developing apps, and it's hard to keep up with evolving business objectives -- a challenge exacerbated by communication gaps. According to a 2019 IDC survey, skills shortages and unrealistic expectations from the C-suite are the top reasons for failure in machine learning projects. In 2018, Element AI estimated that of the 22,000 Ph.D.-educated researchers working globally on AI development and research, only 25% are "well-versed enough in the technology to work with teams to take it from research to application."

There's also the fact that models frequently drift away from what they were intended to accomplish. Assessing the risk of these failures as a part of MLOps is a key step not only for regulatory purposes, but to protect against business impacts. For example, the cost of an inaccurate video recommendation on YouTube would be much lower compared with flagging an innocent person for fraud and blocking their account or declining their loan applications.

The advantage of MLOps is that it puts operations teams at the forefront of best practices within an organization. The bottleneck that results from machine learning algorithms eases with a smarter division of expertise and collaboration from operations and data teams, and MLOps tightens that loop.

What is MLOps?

Implementing MLOps

The benefits of MLOps

More