How IBM’s new supercomputer is making AI foundation models more enterprise-budget friendly

Foundation models are changing the way that artificial intelligence (AI) and machine learning (ML) are able to be used. All that power comes with a cost though, as building AI foundation models is a resource-intensive task.

IBM announced today that it has built out its own AI supercomputer to serve as the literal foundation for its foundation model–training research and development initiatives. Named Vela, it’s been designed as a cloud-native system that makes use of industry-standard hardware, including x86 silicon, Nvidia GPUs and ethernet-based networking.

The software stack that enables the foundation model training makes use of a series of open-source technologies including Kubernetes, PyTorch and Ray. While IBM is only now officially revealing the existence of the Vela system, it has actually been online in various capacities since May 2022.

"We really think this technology concept around foundation models has huge, tremendous disruptive potential," Talia Gershon, director of hybrid cloud infrastructure research at IBM, told VentureBeat. "So, as a division and as a company, we're investing heavily in this technology."

The AI- and budget-friendly foundation inside Vela

IBM is no stranger to the world of high-performance computing (HPC) and supercomputers. One of the fastest supercomputers on the planet today is the Summit supercomputer built by IBM and currently deployed in the Oak Ridge National Laboratory.

The Vela system, however, isn’t like other supercomputer systems that IBM has built to date. For starters, the Vela system is optimized for AI and uses x86 commodity hardware, as opposed to the more exotic (and expensive) equipment typically found in HPC systems.

Unlike Summit, which uses the IBM Power processor, each Vela node has a pair of Intel Xeon Scalable processors. IBM is also loading up on Nvidia GPUs, with each node in the supercomputer packed with eight 80GB A100 GPUs. In terms of connectivity, each of the compute nodes is connected via multiple 100 gigabits-per-second ethernet network interfaces.

Vela has also been purpose built for cloud native, meaning it runs Kubernetes and containers to enable application workloads. More specifically, Vela relies on Red Hat OpenShift, which is Red Hat's Kubernetes platform. Vela has also been optimized to run PyTorch for ML training and uses Ray to help scale workloads.

IBM has also built out a new workload-scheduling system for its new cloud-native supercomputer. For many of its HPC systems, IBM has long used its own Spectrum LSF (load-sharing facility) for scheduling, but that system is not what the new Vela supercomputer is using. IBM has developed a new scheduler called MCAD (multicluster app dispatcher) to handle cloud-native job scheduling for foundation model AI training.

IBM's growing foundation model portfolio

All that hardware and software that IBM put together for Vela is already being used to support IBM's foundation model efforts.

"All of our foundation models’ research and development are all running cloud native on that stack on the Vela system and IBM Cloud," Gershon said.

Just last week, IBM announced a partnership with NASA to help build out foundation models for climate science. IBM is also working on a foundation model called MoLFormer-XL for life sciences that can help create new molecules in the future.

The foundation model work also extends to enterprise IT with the Project Wisdom effort that was announced in October 2022. Project Wisdom is being developed in support of the Red Hat Ansible IT configuration technology. Typically, IT system configuration can be a complicated exercise that requires domain knowledge to do properly. Project Wisdom aims to bring a natural language interface to Ansible, whereby users will simply type in what they want and the foundation model will understand and then help execute the desired task.

Gershon also hinted at a new IBM foundation model for cybersecurity that has not yet been publicly detailed and is being developed using the Vela supercomputer.

"We haven't said much about it externally, I think on purpose," Gershon said about the foundation model for cybersecurity. "We do believe this technology is going to be transformational in terms of detecting threats."

While IBM is building out a portfolio of foundation models, it is not intending to directly compete against some of the well-known general foundation models, such as OpenAI's GPT-3.

"We are not focused on necessarily building general AI, whereas maybe some other players kind of state that more as the goal," Gershon said. "We're interested in foundation models because we think that it has tremendous business value for enterprise use cases."