VAST Data allies with Nvidia on reference storage architecture for AI

VAST Data and Nvidia today unveiled a reference architecture that promises to accelerate storage performance when servers based on graphical processor units (GPUs) access petabytes of data.

The reference architecture is intended to make it simpler to integrate all Flash storage system dubbed Lightspeed with DGX A100 servers from Nvidia. That effort should yield more than 170GB/s of throughput for both GPU-intensive and storage-intensive AI workloads, the companies claim.

Network-attached storage (NAS) systems from VAST Data can be connected to Nvidia servers over NFS-over-RDMA, NFS Multipath, or Nvidia GPUDirect Storage interfaces. The systems can also be incorporated within a larger converged storage fabric that might be based on multiple storage protocols.

VAST Data makes use of proprietary erasure coding, containers, Intel Optane memory, data deduplication, compression, and an extension it developed to the Network File System (NFS) that organizations have used for decades to access distributed storage systems. That approach eliminates the need for IT teams to acquire and deploy a storage system based on a parallel file system to run AI workloads. NFS is both easier to deploy and a familiar file system for most existing storage administrators.

The company has thus far raised $180 million to fuel an effort to replace hard disk drives (HDD) in environments that need to access large amounts of data in sub-milliseconds. The AI workloads running Nvidia servers are typically trying to access a lot of small files residing within a storage environment that can easily reach petabytes of scale. Those servers become more efficient because all the responsibility for processing storage is offloaded to the Lightspeed platform.

It's not clear how many organizations will be deploying servers to train AI models in on-premises IT environments. The bulk of AI models are trained in the cloud because many data scientist teams don't want to invest in IT infrastructure.

However, there are still plenty of organizations that prefer to retain control of their data for compliance and security reasons. In addition, organizations that have invested in high-performance computing (HPC) systems are looking to run AI workloads that tend to have very different I/O requirements than legacy HPC applications. In fact, one of the reasons Nvidia acquired Mellanox for $6.8 billion last year was to gain control over the switches required to create storage fabrics spanning multiple servers.

IT organizations will be looking for storage systems capable of simultaneously supporting their existing applications and AI workloads that are starting to be deployed more frequently in production environments, VAST data cofounder and CMO Jeff Denworth said.

"Once you build an AI model, you then have to operationalize it," Denworth said.

Competition among providers of storage systems optimized for AI workloads is heating up as the number of AI workloads being deployed steadily increases. Over the course of the next few years, just about every application will be augmented by AI models to one degree or another. The challenge now is determining how best to manage and store all the massive amounts of data those AI models need to consume.

More