Run:AI integrates GPU optimization tool with MLOps platforms

Run:AI today announced it has added support for both MLflow, an open source tool for managing the lifecycle of machine learning algorithms, and Kubeflow, an open source framework for machine learning operations (MLOps) deployed on Kubernetes clusters, to its namesake tool for graphical processor unit (GPU) resource optimization. The company also revealed that it has added support for Apache Airflow, open source software that can be employed to programmatically create, schedule, and monitor workflows.

The overall goal is to enable GPU optimization, as well as training AI models from within an MLOps platform, Run:AI CEO Omri Geller told VentureBeat. "It can be managed more end-to-end," he said.

While some organizations have standardized on a single MLOps platform, others have multiple data science teams that have decided to employ different MLOps platforms. But all of the data science projects usually still share access to a limited number of GPU resources that today are among the most expensive infrastructure resources being consumed within an enterprise IT environment.

GPU optimization is just the start

IT teams have been optimizing infrastructure resources for decades. GPUs are simply the latest in a series of infrastructure resources that need to be shared by multiple applications and projects. The issue is that enterprise IT teams have in place plenty of tools to manage CPUs, but those tools were not designed to manage GPUs.

Previously, Run.AI provided IT teams with either a graphical user interface dubbed ResearherUI to manage GPU resources or presented them with a command line interface (CLI). Now either an enterprise IT team or the data science team itself can manage GPU resources directly from within the platforms they are also employing to manage MLOps, Geller added.

Run:AI dynamically allocates limited GPU resources to multiple data science jobs based on policies defined by an organization. These policies create quotas for different projects in a way that maximizes utilization of GPUs. Organizations can also create logical fractions of GPUs or execute jobs across multiple GPUs or nodes. The Run:AI platform itself uses Kubernetes to orchestrate the running of jobs across multiple GPUs.

IT infrastructure optimization

It's not clear to what degree data science IT teams are managing IT infrastructure themselves versus relying on IT teams to manage those resources on their behalf. However, as the number of AI projects with enterprise IT environments continues to multiply, contention for GPU resources will only increase. Organizations will need to be able to dynamically prioritize which projects will have access to GPU optimization resources based on both availability and cost.

In the meantime, two distinct data science and IT operations cultures are starting to converge. The hope is that if data science teams spend less time on tasks like data engineering and managing infrastructure, they will be able to increase the rate at which AI models are created and successfully deployed in production environments. Achieving that goal requires relying more on IT operations teams to handle many of the lower-level tasks that many data science teams currently perform. The challenge is that the culture of the average data science team tends to differ from the culture of IT operations teams, which are usually focused on efficiency.

One way or another, however, it's only a matter of time before traditional IT operations teams start to exercise more control over MLOps. Most data scientists would ultimately prefer to see that happen, given their general lack of IT expertise. The issue they will need to come to terms with is that IT operations teams tend to ruthlessly implement best practices in a way that doesn't always leave a lot of exceptions to an established rule.

GPU optimization is just the start

IT infrastructure optimization

More