Open-source Ray 2.4 upgrade speeds up generative AI model deployment

The open source Ray machine learning (ML) technology for deploying and scaling AI workloads is taking a big step forward today with the release of version 2.4. The new release takes specific aim at accelerating generative AI workloads.

Ray, which benefits from a broad community of open-source contributions, as well as the support of lead commercial vendor Anyscale, is among the most widely used technologies in the ML space. OpenAI, the vendor behind GPT-4 and ChatGPT, relies on Ray to help scale up its machine learning training workloads and technology. Ray isn't just for training; it's also broadly deployed for AI inference as well.

The Ray 2.x branch first debuted in August 2022 and has been steadily improved in the months since, including the Ray 2.2 release, which focused on observability.

With Ray 2.4, the focus is squarely on generative AI workloads, with new capabilities that provide a faster path for users to get started building and deploying models. The new release also integrates with models from Hugging Face, including GPT-J for text and Stable Diffusion for image generation.

"Ray is basically providing the open-source infrastructure for managing the LLM [large language model] and generative AI life cycle for training, batch inference, deployment and the productization of these workloads," Robert Nishihara, cofounder and CEO of Anyscale, told VentureBeat. "If you want everyone in every business to be able to integrate AI into their products, it's about lowering the barrier to entry, reducing the level of expertise that you need to build all the infrastructure."

How Ray 2.4 is generating new workflows for generative AI

The way that Ray 2.4 is lowering the barrier to building and deploying generative AI is with a new set of prebuilt scripts and configurations.

Rather than users needing to configure and script each and every type of generative AI deployment manually Nishihara said Ray 2.4 users will be able to get up and running — out of the box.

"This is providing a very simple starting point for people to get started," he said. "They're still going to want to modify it and bring their own data, but they will have a working starting point that is already getting good performance."

Nishihara was quick to note that what Ray 2.4 provides is more than just configuration management. A common way for many types of technologies to be deployed to today is with infrastructure-as-code tooling such as Terraform or Ansible. He explained the goal is not just about configuring and setting up the cluster to enable a generative AI model to run; with Ray 2.4, the goal is to actually provide runnable code for training and deploying and LLM. Functionally, what Ray 2.4 is providing is a set of Python scripts that a user would have otherwise needed to write on their own in order to deploy a generative AI model.

"The experience you want developers to have is, it's like one click and then you have an LLM behind some endpoint and it works," he said.

The Ray 2.4 release is targeting a specific set of generative AI integrations using open-source models on Hugging Face. Among the integrated model use cases is GPT-J, which is a small-scale text-generation model. There is also an integration for fine-tuning the DreamBooth image-generation model, as well as supporting inference for the Stable Diffusion image model. Additionally, Ray 2.4 provides integration with the increasingly popular LangChain tool, which is used to help build complex AI applications that use multiple models.

A main feature of Ray is the Ray AI Runtime (AIR), which helps users to scale ML workflows. Among the AIR components is one called a trainer, which (not surprisingly) is designed for training. With Ray 2.4, there are a series of new integrated trainers for ML-training frameworks, including ones for Hugging Face Accelerate and DeepSpeed, as well as PyTorch Lightning.

Performance optimizations in Ray 2.4 accelerate training and inference

A series of code optimizations were made in Ray 2.4 that help boost performance. One of these is the handling of array data, which is a way that data is stored and processed. Nishihara explained that the common approach for handling data for AI training or inference is to have multiple, disparate stages where data is first processed, and then operations such as training or inference are executed. The challenge is that the pipeline for executing those stages can introduce some latency where compute and GPU resources are not being fully utilized.

With Ray 2.4, instead of processing data in stages, Nishihara said the technology now streams and pipelines the data such that it all fits into memory at the same time. In addition, to keep the overall utilization as high as possible, there are optimizations for preloading some data onto GPUs.

It's not just about keeping GPUs busy, it's also about CPUs busy too.

"The processing you're doing should run on CPUs and some of the processing you're doing should run on GPUs," Nishihara said. "You want to keep everything busy and scaled on both dimensions. That's something that Ray is uniquely good at and that is hard to do otherwise."

How Ray 2.4 is generating new workflows for generative AI

Performance optimizations in Ray 2.4 accelerate training and inference

More