Domino accelerates MLOps with new Nvidia integrations

Domino Data Lab announced new integrations with Nvidia this week to make it easier to adopt AI infrastructure, scale GPU clusters, run more virtual workloads on high-end GPUs, and package AI apps into container infrastructure.

Domino's tools streamline the grunt work associated with building out AI and ML applications. Domino automatically spins up workspaces or models on shared infrastructure so many people can share the same infrastructure. When someone is finished with a workload, Domino spins down that workspace to free up the resources for someone else. Domino also tracks usage, letting IT administrators see consumption and make informed decisions about when to increase computing power.

Gartner considers AI orchestration tooling that includes MLOps to be a key trend in 2021.

Easier GPU clustering

Domino currently supports ephemeral clusters built on Apache Spark and Ray, and the company plans to add support for Dask this fall. Domino strategic partnerships VP Thomas Robinson told VentureBeat that Spark has traditionally excelled at large-scale data processing and transformations. Ray has simplified distributed training and hyperparameter optimizations, and Dask has excellent integration with commonly used Pandas and NumPy libraries.

Domino also improved the ability to provision GPU clusters required to run AI training jobs that require more than one Nvidia GPU. Traditionally, it could be difficult and time-consuming to set up machines, ensure network connectivity, and install proper libraries. In addition, it is uncommon for enterprises to give data scientists access and permission to manipulate infrastructure directly. As a result, teams often leave clusters idle between larger projects, rather than reallocate the individual machines for smaller projects.

To improve utilization rates, Domino makes it possible to spin up and spin down interactive sessions, batch jobs, or models hosted on Nvidia DGX infrastructure to allow multiple concurrent and consecutive sessions. Previously users depended on email and spreadsheets to coordinate workloads, which was inefficient.

Domino will add support for Nvidia's multi-instance GPU technology in September. MIG allows a single GPU to be sliced up into smaller portions (7 slices per GPU for each of the 8 GPUs in a DGX A100 -- a total of 56 slices). This will make it possible to divide the capacity of a larger GPU server or cluster into multiple instances or partitions to host many more predictive models on smaller GPU instances. While many deep learning training workloads require a whole machine or multiple machines in a cluster, research, or inference (prediction), workloads are much less GPU-intensive.

"By allowing the GPU to be portioned into pieces, you can have more researchers doing discovery work in notebooks on smaller GPU slices," Robinson said.

Added container support

Domino also announced immediate support for Nvidia's new NGC container registry service. This makes it easier to package vetted application and configuration settings into container instances that bake in best practices. This means a data scientist doesn't have to spend time figuring out how to set up and install all the drivers and tools they need. It also allows organizations to standardize on these containers.

NGC currently supports RAPIDS, TensorFlow, PyTorch, and CUDA. Domino additionally supports containers for SAS, MATLAB, Amazon SageMaker, and private container repositories.

Finally, Domino worked with Nvidia and NetApp to develop a preconfigured hardware/software package called the ONTAP AI Integration Solution. "This is a specced, tested, and verified packaging of everything you need to accelerate your data science work -- so there's no guesswork and no setup needed for an IT department," Robinson said.

Easier GPU clustering

Added container support

More