University of Pisa leans into the I/O challenge AI applications create

At a time when workloads that employ machine and deep learning algorithms are being built and deployed more frequently, organizations need to optimize I/O throughput in a way that enables those workloads to cost-effectively share the expensive GPU resources used to train AI models. Case in point: the University of Pisa, which has been steadily expanding the number of GPUs it makes accessible to AI researchers in a green datacenter optimized for high-performance computing (HPC) applications.

The challenge the university has encountered as it deploys AI is that machine learning and deep learning algorithms tend to make more frequent I/O requests to a larger number of smaller files than traditional HPC applications, University of Pisa CTO Maurizio Davini said. To accommodate that, the university has deployed NVMesh software from Excelero that can access more than 140,000 small files per second on Nvidia DGX A100 GPU servers.

While Davini said he generally views AI applications as just another type of HPC workload, the way AI workloads access compute and storage resources requires a specialized approach. The NVMesh software addresses that approach by offloading the increasingly frequent I/O requests, freeing up additional compute resources on the Nvidia servers for training AI models, Davini said.

"We wanted to provide our AI researchers with a better experience," he said.

Excelero is among a bevy of companies that are moving to address the I/O challenges IT teams will encounter when trying to make massive amounts of data available to AI models. As the number of AI models that organizations build and maintain starts to grow, legacy storage systems can't keep pace. The University of Pisa deployed Excelero to make sure the overall IT experience of its AI researchers remains satisfactory, Davini said.

Of course, more efficient approaches to managing I/O only begin to solve the data management issues organizations that build their own AI models will encounter. IT teams have tended to manage data as an extension of the application employed to create it. That approach is the primary reason there are so many data silos strewn across the enterprise.

Even more problematic is the fact much of the data in those silos conflicts because different applications might have rendered a company name differently or may not have been updated with the most recent transaction data. Having a single source of truth about a customer or event at any specific moment in time remains elusive.

AI models, however, require massive amounts of accurate data to be trained properly. Otherwise, the models will generate recommendations based on inaccurate assumptions because the data the machine learning algorithms were exposed to was either inconsistent or unreliable. IT organizations are addressing that issue by first investing heavily in massive data lakes to normalize all their data and then applying DataOps best processes, as outlined in a manifesto that describes how to automate as many data preparation and management tasks as possible.

Legacy approaches to managing data based on manual copy and paste processes are one of the primary reasons it takes so long to build an AI model. Data science teams are lucky if they can roll out two AI models a year. Cloud service providers like Amazon Web Services (AWS) offer products such as Amazon SageMaker to automate the construction of AI models, increasing the rate at which AI models are created in the months ahead.

But not every organization will commit to building AI models in the cloud. That requires storing data in an external platform, which creates a range of potential compliance issues they might rather avoid. The University of Pisa, for example, finds it easier to convince officials to allocate budget to a local datacenter than to give permission to access an external cloud, Davini noted.

Ultimately, the goal is to eliminate the data management friction that has long been a plague on IT by adopting a set of DataOps processes that are similar in nature to the DevOps best practices widely employed to streamline application development and deployment. However, all the best practices in the world won't make much of a difference if the underlying storage platform is simply too slow to keep up.

More