Join top executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Learn More

Foundation models have the potential to change the way organizations build artificial intelligence (AI) and train with machine learning (ML).

A key challenge for building foundation models is that, to date, they have generally required the use of specific types of networking and infrastructure hardware to run efficiently. There has also been limited support for developers wanting to build a foundation model with an entirely open-source stack. It’s a challenge that IBM Research is looking to help solve in a number of ways.

>>Don’t miss our special issue: Zero trust: The new security paradigm.<<

“Our question was, can we train foundation models but train it in such a way that we are doing it on commodity hardware? And make it more accessible rather than just be in the hands of a few select researchers,” Raghu Ganti, principal research staff member at IBM, told VentureBeat.


Transform 2023

Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.


Register Now

To that end, IBM announced today that it has developed and contributed code to the open-source PyTorch machine learning project to enable the technology to work more efficiently with commodity ethernet-based networking. IBM has also built an open-source operator that helps to optimize the deployment of PyTorch on the Red Hat OpenShift platform, which is based on the open-source Kubernetes cloud container orchestration project.

To infinity and beyond: how IBM helped to extend PyTorch 

To date, many foundation models have been trained on hardware that support the InfiniBand networking stack that is typically only found on high-performance computing (HPC) hardware.

While GPUs are the foundation of AI, in order to get multiple GPUs to connect with each other, there is a need for high-performance networking technology. Ganti explained that it is possible to train large models without InfiniBand networking but it is inefficient in a number of ways.

For example, he said that with the default PyTorch technology, training an 11-billion-parameter model, over an ethernet-based network, could be done with only 20% GPU efficiency. Improving that efficiency is what IBM did alongside the PyTorch community.

“This is a very complex problem and there are many knobs to tune,” Ganti said. 

The knobs that need to be tuned are all about making sure there is optimized GPU and network utilization. Ganti said that the goal is to keep both the network and the GPU busy at the same time to accelerate the overall training process.

The code to make PyTorch optimized to work better over ethernet was merged into the PyTorch 1.13 update that became generally available on Oct. 28.

“We were able to go from 20% GPU utilization all the way to 90%, and that’s like a 4.5x improvement in terms of training speeds,” Ganti said.

Shifting PyTorch into high gear for faster training

In addition to the code improvements in PyTorch, IBM has also worked to enable the open-source Red Hat OpenShift Kubernetes platform to support the development of foundation models.

Ganti said part of what they’ve done is ensure that whatever maximum bandwidth the ethernet network can provide is exposed at the pod level in OpenShift. 

The use of Kubernetes to train foundation models isn’t a new idea. OpenAI, which is the organization behind some of the most widely used models, including GPT-3 and DALL-E, has publicly discussed how it uses Kubernetes. What IBM claims is new is having the technology to do so being available as open source. IBM has open-sourced a Kubernetes operator that provides the necessary configuration to help organizations scale a cluster to support large model training.

With the PyTorch Foundation, more open-source innovation is now possible

Until September, PyTorch had been operated as an open-source project managed by Meta. That changed on Sept. 12, when the PyTorch Foundation was announced as a new organizing body run by the Linux Foundation.

Ganti said the IBM effort to contribute code into PyTorch actually began before the announcement of the new PyTorch Foundation. He explained that under Meta’s governance, IBM actually couldn’t directly commit code to the project. Instead the code had to be committed by Meta staffers who had commit access.

Ganti expects that under the Linux Foundation’s guidance, PyTorch will become more collaborative and open. “I think it [PyTorch Foundation] will improve open-source collaboration,” Ganti said.

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.