Ray 2.2 boosts machine learning observability and scalability performance

Ray, the popular open-source machine learning (ML) framework, has released its 2.2 version with improved performance and observability capabilities, as well as features that can help to enable reproducibility.

The Ray technology is widely used by organizations to scale ML models across clusters of hardware, for both training and inference. Among Ray's many users is generative AI pioneer OpenAI, which uses Ray to scale and enable a variety of workloads, including supporting ChatGPT. The lead commercial sponsor behind the Ray open-source technology is San Francisco-based Anyscale, which has raised $259 million in funding to date.

The new Ray 2.2 release continues to build out a series of capabilities first introduced in the Ray 2.0 update in August 2022, including Ray AI Runtime (AIR) that is designed to serve as a runtime layer for executing ML services. With the new release, the Ray Jobs feature is moving from being a beta feature to general availability, providing users with the ability to more easily schedule and repeat ML workloads.

Ray 2.2 also provides a series of capabilities intended to help improve observability of ML workloads, helping data scientists ensure efficient use of hardware computing resources.

"One of the most common and challenging things about scaling machine learning applications is debugging, which is basically figuring out what went wrong," Robert Nishihara, cofounder and CEO of Anyscale, told VentureBeat. "One of the most important things we can do with Ray is to improve the tooling around observability."

Where observability matters for scaling AI/ML workloads

Ray fits into a number of common use cases for helping organizations scale artificial intelligence (AI) and ML workloads.

Nishihara explained that Ray is commonly used to help scale up and run training workloads for ML models. He noted that Ray is also used for AI inference workloads, including computer vision and natural language processing (NLP), where lots of images or text are being identified.

Increasingly, organizations are using Ray for multiple workloads at the same time, which is where the Ray AIR fits in, providing a common layer for ML services. With Ray 2.2, Nishihara said that AIR benefits from performance improvements that will help accelerate training and inference.

Ray 2.2 also has a strong focus on helping improve observability for all types running workloads. The observability enhancements in Ray 2.2 are all about making sure that all types of workloads have the right amount of resources to run. Nishihara said that one of the biggest classes of errors that ML workloads encounter is running out of resources, such as CPU or GPU memory. Among the ways that Ray 2.2 improves observability into resource-related issues is with new visualization on the Ray Dashboard that help operators better understand resource utilization and capacity limits.

How Ray Jobs will give AI reproducibility and explainability a boost

The Ray 2.2 release also includes the general availability for the Ray Jobs feature that helps users deploy workloads in a consistent and repeatable approach.

Nishihara explained that Ray Jobs includes both the actual application code for the workload as well as a manifest file that describes the required environment. The manifest lists all the details needed to run a workload, such as application code and dependencies needed in an environment to execute the training or inference operation.

The ability to easily define the requirements for how an AI/ML workload should run is a key part of enabling reproducibility, which is what Ray Jobs is supporting. Reproducibility is also a foundational element of enabling explainability, according to Nishihara.

"You need reproducibility to be able to do anything meaningful with explainability," Nishihara said.

He noted that generally, when people talk about explainability, they're talking about being able to interpret what an ML model is actually doing. For example, why a model reached a certain decision.

"You need a strong experimental setup to be able to start to ask these questions, and that includes reproducibility," he said.

Where observability matters for scaling AI/ML workloads

How Ray Jobs will give AI reproducibility and explainability a boost

More