IBM propels PyTorch beyond model training into AI inference

The open source PyTorch machine learning (ML) framework is widely used today for AI training, but that's not all it can do. IBM sees broader applicability for PyTorch and is working on a series of development initiatives that will see PyTorch used for inferencing.

In an exclusive interview with VentureBeat, Raghu Ganti, principal research staff member at IBM detailed new research efforts that enable PyTorch to become a more viable enterprise option for inference. The market for inferencing technology software today has multiple players, with perhaps none larger than Nvidia's Triton inferencing server. IBM's goal with its PyTorch research is not necessarily to displace other technologies, but to provide a new open source alternative for inference that will run on multiple vendor technologies, as well as on both GPU and CPUs.

PyTorch is an open source project originally started by Meta (formerly Facebook) that moved to an open governance model at the Linux Foundation with the launch of the PyTorch Foundation in Sept 2022. IBM is an active member of the PyTorch Foundation and with its new research is looking to help advance enterprise deployment of the open source technology.

"Much of the community has been looking at PyTorch as a way to train models," Ganti told VentureBeat. "Training is only one part of the problem, right? I trained a model, hurray I have the best model there, but how do you actually put it in the hands of clients and that's a long journey and every millisecond that you can shave off on these large models is going to accumulate in terms of the cost for putting these models in production."

How IBM is helping to accelerate PyTorch for inference

The requirements of inference are somewhat different than training as there is a need for more speed and less latency to enable rapid responses.

"Typically, when you're measuring inference, the metric that you use is median latency for a given prompt sequence length," Ganti said.

The IBM team is combining three techniques within PyTorch - graph fusion, kernel optimizations, and parallel tensors - to achieve faster inference speeds. Using these combined optimizations on PyTorch nightly builds, the IBM researchers were able to achieve inference speeds of 29 milliseconds per token on a 100 GPU system for a large language model with 70 billion parameters.

The three techniques that IBM is using to accelerate inference are all about removing process bottlenecks and improving access to memory. Ganti noted that a common performance slowdown for AI occurs when a process needs to go back and forth from a CPU to a GPU. Graph fusion is a capability that reduces the volume of communications needed between the CPU and GPU to help accelerate inference. Ganti explained that kernel optimization in PyTorch is all about streamlining attention computation by optimizing memory access for inference, which helps to provide better performance.

The third technique that IBM is using to improve PyTorch inference is known as parallel tensors, which is also about memory improvement. Ganti said that large language models (LLMs) today typically are too large to fit on a single GPU, which means they typically run across multiple GPUs. Parallel tensors work with the graph fusion and kernel optimizations to help accelerate inference.

PyTorch 2.1 is coming

Ganti emphasized that IBM's efforts to accelerate PyTorch for inferencing are not yet ready for production deployment.

Some of the optimizations that IBM is using to improve inference are based on capabilities in the current PyTorch nightly releases that will become more widely available in the upcoming PyTorch 2.1 update that is set to debut later this month. IBM also has a lot of new code that isn't yet part of the open source project, though Ganti said that IBM’s goal is to contribute the inference optimization capabilities and get the code merged into the mainline project.

Looking forward, IBM is also working on another capability, known as dynamic batching, to help scale out PyTorch's inference capabilities for enterprise deployments. Ganti explained that dynamic batching is a technique for improving GPU utilization for model inference. It involves dynamically grouping together multiple inference requests or "prompts'' that come in concurrently and processing them as a batch on the GPU, rather than individually. This allows the GPU to be utilized more efficiently since inferencing typically has low load from a single user.

"From our perspective making PyTorch really enterprise ready is key," Ganti said.

How IBM is helping to accelerate PyTorch for inference

PyTorch 2.1 is coming

More