Want optimized AI? Rethink your storage infrastructure and data pipeline

This article is part of the Technology Insight series, made possible with funding from Intel.

Most discussions of AI infrastructure start and end with compute hardware -- the GPUs, general-purpose CPUs, FPGAs, and tensor processing units responsible for training complex algorithms and making predictions based on those models. But AI also demands a lot from your storage. Keeping a potent compute engine well-utilized requires feeding it with vast amounts of information as fast as possible. Anything less and you clog the works and create bottlenecks.

Optimizing an AI solution for capacity and cost, while scaling for growth, means taking a fresh look at its data pipeline. Are you ready to ingest petabytes worth of legacy, IoT, and sensor data? Do your servers have the read/write bandwidth for data preparation? Are they ready for the randomized access patterns involved in training?

Answering those questions now will help determine your organization’s AI-readiness. So, let's break down the various stages of an AI workload and explain the role your data pipeline plays along the way.

Key Points

The volume, velocity, and variety of data coursing through the AI pipeline changes at every stage.
Building a storage infrastructure able to satisfy the pipeline’s capacity and performance requirements is difficult.
Lean on modern interfaces (like NVMe), flash, and other non-volatile memory technologies, and disaggregated architectures to scale effectively.

It begins with lots of data, and ends with predictions

AI is driven by data -- lots and lots of data. The average factory creates 1TB of the stuff every day, but analyzes and acts upon less than 1% of it. Right out of the gate, then, an AI infrastructure must be structured to take in massive amounts of data, even if it’s not all used for training neural networks. "Data sets can arrive in the pipeline as petabytes, move into training as gigabytes of structured and semi-structured data, and complete their journey as trained models in the kilobyte size," noted Roger Corell, storage marketing manager at Intel.

The first stage of an AI workload, ingestion, involves collecting data from a variety of sources, typically at the edge. Sometimes that information is pulled onto a centralized high-capacity data lake for preparation. Or it might be routed to a high-performance storage tier with an eye to real-time analytics. Either way, the task is characterized by a high volume of large and small files written sequentially.

The next step, data preparation, involves processing and formatting raw information in a way that makes it useful for subsequent stages. Maximizing data quality is the preparation phase’s primary purpose. Capacity is still critical. However, the workload evolves to become a mix of random reads and writes, making I/O performance an important consideration as well.

Structured data is then fed into a neural network for the purpose of creating a trained model. A training dataset might contain millions of examples of whatever it is the model is learning to identify. The process is iterative, too. A model can be tested for accuracy and then retrained to improve its performance. Once a neural network is trained, it can be deployed to make predictions based on data it has never seen before—a process referred to as inferencing.

Training and inferencing are compute-intensive tasks that beg for massively parallel processors. Keeping those resources fed requires streams of small files read from storage. Access latency, response time, throughput, and data caching all come into play.

Be flexible to support AI’s novel requirements at each stage

At each stage of the AI pipeline, your storage infrastructure is asked to do something different. There is no one-size-fits-all recipe for success, so your best bet is to lean on storage technologies and interfaces with the right performance today, a roadmap into the future, and an ability to scale as your needs change.

Source: Intel

For instance, hard disks might seem like an inexpensive answer to the ingestion stage’s capacity requirements. But they aren’t ideal for scaling performance or reliability. Even Serial ATA (SATA) SSDs are bottlenecked by their storage interface. Drives based on the Non-Volatile Memory Express (NVMe) interface, which are attached to the PCI Express (PCIe) bus, deliver much higher throughput and lower latency.

NVMe storage can take many shapes. Add-in cards are popular, as is the familiar 2.5” form factor. Increasingly, though, the Enterprise & Datacenter SSD Form Factor (EDSFF) makes it possible to build dense storage servers filled with fast flash memory for just this purpose.

Standardizing on PCIe-attached storage makes sense at other points along the AI pipeline, too. The data preparation stage’s need for high throughput, random I/O, and lots of capacity is satisfied by all-flash arrays that balance cost and performance. Meanwhile, the training and inference phases require low latency and excellent random I/O. Enterprise-oriented flash or Optane SSDs would be best for keeping compute resources fully utilized.

Growing with your data

An AI infrastructure erected for today’s needs will invariably grow with larger data volumes and more complex models. Beyond using modern devices and protocols, the right architecture helps ensure performance and capacity scale together.

In a traditional aggregated configuration, scaling is achieved by homogeneously adding compute servers with their own flash memory. Keeping storage close to the processors is meant to prevent bottlenecks caused by mechanical disks and older interfaces. But because the servers are limited to their own storage, they must take trips out to wherever the prepared data lives when the training dataset outgrows local capacity. As a result, it takes longer to serve trained models and start inferencing.

Efficient protocols like NVMe make it possible to disaggregate, or separate, storage and still maintain the low latencies needed by AI. At the 2019 Storage Developer Conference, Dr. Sanhita Sarkar, global director of analytics software development at Western Digital, gave multiple examples of disaggregated data pipelines for AI, which included pools of GPU compute, shared pools of NVMe-based flash storage, and object storage for source data or archival, any of which could be expanded independently.

There’s not a moment to lose

If you aren’t already evaluating your AI readiness, it’s time to play catch-up. McKinsey’s latest global survey indicated a 25% year-over-year increase in the number of companies using AI for at least one process or product. Forty-four percent of respondents said AI has already helped reduce costs. “If you are a CIO and your organization doesn’t use AI, chances are high that your competitors do and this should be a concern, added Chris Howard, Gartner VP.

The investments pouring into AI are accelerating, too. IDC says spending on AI systems will hit almost $98 billion three years from now, up from $37.5 billion in 2019. And there’s another interesting observation nestled in IDC’s analysis: “The largest share of technology spending in 2019 will go toward services, primarily IT services, as firms seek outside expertise to design and implement their AI projects.” Clearly, there’s a need for professionals versed in the intricacies of AI pipelines.

Most businesses know that AI is compute-intensive. But the technology’s demands on storage aren’t as widely discussed. Take stock of what your storage infrastructure is capable of and where it might need some reinforcement before prototyping your own project. With modern drives attached via NVMe, scalable through a disaggregated architecture, you should be well-equipped to handle the capacity, performance, and scaling requirements of the most data-driven applications.