Explosive growth in AI compute shows enterprises must get smart about strategy

Artificial intelligence research organization OpenAI recently released a report that shows the amount of compute power needed for training runs in the largest machine learning systems has increased by 300,000 times since 2012. Because machine learning results improve when given additional computing resources, we’ll likely see even greater demands for silicon infrastructure to drive better results.

Enterprises are increasingly using machine learning to automate complex problems and analytical tasks. But OpenAI’s research shows there’s a key challenge ahead: How can enterprises build the infrastructure they need to produce the business results they want when the technical requirements keep changing?

Keep it simple

First off, enterprises should try to find the least complicated algorithm necessary to solve the business problem at hand. While massively complicated neural networks are the rage in the machine learning field right now, plenty of problems are easily addressed by less-involved techniques. For example, data science competitions on Kaggle often are won by gradient-boosted decision trees rather than deep neural networks.

Sure, some companies have problems that are best solved by more complex models, but the key is to minimize complexity where possible. That provides several benefits, as simpler models are easier to troubleshoot, more understandable, and less costly to train.

Get smart about physical infrastructure

Second, enterprises should purchase only the hardware they need to fulfill the requirements of their machine learning workloads. This may seem obvious, but it's especially important in this case because silicon vendors are rapidly changing what they sell to address the needs of this relatively new market. Consider Nvidia's new Volta architecture, which includes dedicated acceleration for machine learning tasks.

Where applicable, companies should leverage cloud platforms that simplify the provisioning for fleets of AI hardware, especially for workloads without settled needs. It's far easier to deploy a large GPU cluster on the fly in one of the major clouds than to procure and set up such an environment in a private datacenter. If that cluster is no longer needed, enterprises can shut it down.

Right now, OpenAI expects the top algorithms are running on clusters that cost "single digit millions of dollars" to purchase. But going out and provisioning a cluster of that magnitude doesn't make sense unless it will meet a business need now or in the immediate future.

Infrastructure needs are a moving target because of potential improvements to the algorithms underpinning these machine learning systems. A recent contest held by the Data Analytics for What's Next (DAWN) group at Stanford University showed how both software and hardware optimization can create highly accurate models at low cost.

Use established foundations

Before even looking to provision large amounts of silicon, companies should standardize on one or two of the popular open source frameworks that help with the development and deployment of machine learning models. Google's TensorFlow appears to be a safe bet, given its popularity in the open source space, but there are several other options out there with high-profile backers, such as Microsoft's Cognitive Toolkit (CNTK), PyTorch (used by Facebook), and Apache MXNet (Amazon's preferred framework).

When companies like Nvidia and Intel try to provide optimized execution environments for machine learning algorithms running on their hardware, they often focus efforts on a handful of frameworks like the ones listed above. These sorts of software optimizations can help boost performance without requiring companies to procure additional hardware. In addition, these frameworks make it easier to create machine learning systems, and companies can use that agility to take advantage of new techniques that arise in the field or just-improved methods of handling already known problems.

The bottom line is that machine learning is a rapidly changing field that requires enterprises to think critically about their investments in infrastructure and build an agile environment capable of adjusting to shifts in the state of the art that can require massive computing power.

Blair Hanley Frank is a technology analyst at ISG covering cloud computing, application development modernization, AI, and the modern workplace. He was previously a staff writer at VentureBeat.

Keep it simple

Get smart about physical infrastructure

Use established foundations

More