Nvidia Air gives new meaning to infrastructure-as-code

Follow along with VentureBeat’s coverage from Nvidia’s GTC 2022 event >>

Enterprises increasingly leverage infrastructure-as-code (IaC) to systematically provision cloud resources and containerized workloads. IaC is a critical element of modern software development pipelines that ensures consistency and helps enterprises respond to problems or experiment with new business ideas.

At the Nvidia GTC conference, Nvidia engineers described their work to build a digital twin of data center infrastructure. This work promises to extend IaC and continuous integration/continuous deployment practices all the way into physical data center design.

Nvidia has been using these new tools internally to improve its own data center design and is now starting to integrate them into Nvidia Air. This complements other digital twins offerings like Nvidia Drive for autonomous vehicles, Isaac for Robots and Clara for healthcare.

Nvidia Air will allow enterprises to build a full digital twin of a data center's physical and logical layout before installing the first switch in the data center. They can then continue to use those same simulations, visualizations and AI tools once the data center is in production. Today, most of the design assets are essentially filed away and forgotten once a data center goes live, which in many respects mirrors the old waterfall style of test and development before Agile came along.

Lost assets

These challenges are only growing in complexity with the need for new AI infrastructure that stretches the limits of compute, networking, storage, power and thermal management. "Many classic supercomputers cost millions of dollars and take months or even years to deploy," said Marc Hamilton, vice president of solutions architecture and engineering at Nvidia.

Designing a data center is an extremely complex team sport with diverse skills. The data center building itself and the layout of racks and other components might be done in Autodesk. The cables, servers, switches and storage are designed with various 3D CAD tools.

Teams often turn to other tools for modeling airflow and heat using computational fluid dynamics simulations from Ansys. These kinds of simulations are usually done in design, but once the computer goes into production, the operations team never sees them. If a problem arises, the operations team needs to start over again to figure out how to improve airflow or address an overheating issue.

Nvidia worked with design tools from many vendors in the past, and the resulting files were incompatible across engineering teams. It was generally a time-consuming process to transfer files across tools, and, in some cases, the formats were not compatible. If an engineer changed the layout to improve the thermal properties, it wasn't always propagated back to the team designing heat sinks or cable routing.

Design for reuse

So Nvidia turned to the Omniverse to see if there was a better way to connect these workflows. Omniverse is built on top of a common database called Nucleus, which allows all engineering tools to stage their data in a shared format across tools and teams. The Omniverse helps teams go back and forth between the photorealistic rendering of the data center as-built, overlaid with live thermal data, to analyze the predicted impact of various changes, such as moving two busy servers further apart.

Most engineering simulation is done with high-performance workstations. The Omniverse allows teams to move more of the complex engineering and simulation workloads to tens of thousands of GPUs in the cloud and then share the result across the enterprise and partners.

Another advantage of connecting back to the Omniverse is that new simulations can take advantage of improvements in the core algorithms. One of the biggest aspects of data center design is the computational fluid dynamics to understand the system's airflow, heating and cooling. Hamilton's team worked with Nvidia Modulus, a software development kit that uses AI to build surrogate models for physics. This allows them to simulate far more scenarios, such as minor differences in temperature settings or physical placement in the same amount of time.

Now Nvidia is extending these modeling capabilities into its data center management tools called Base Command. This provides a set of tools to monitor and manage services. Today, if conditions change in the data center, such as a temperature spike, teams only have a rough idea of what might have caused it.

Now Nvidia is exploring ways to extend Omniverse simulation capabilities to support logical infrastructure as well. This will make it easier to develop and test best practices for setting up networks, running power lines, and other things. This was one of the reasons Nvidia acquired Mellanox. "We started thinking about how to apply tools like omniverse to simulating, predicting, and monitoring before you make changes to the network," Hamilton said.

Devops for hardware

Amit Katz, vice president of the Nvidia Spectrum Platform, said the use of digital twins in data center designs is akin to the adoption of automation in the data center at the turn of the century. In the 1990s, engineers would typically type CLI command into live data center environments. And sometimes, they would type the wrong commands.

Then around the turn of the century, developers started provisioning IaC and developing against test environments that mimicked the real thing. Tools like Service Virtualization and test harnesses allowed teams to simulate API calls to enterprise and third-party services before pushing things into production. Now in 2022, he believes the world is going through a similar transition to simulate physical infrastructure as well.

Katz said, "We are seeing digital twins for end-to-end data center validation, not only for switches but also for the entire data center." Down the road, Nvidia Air could work as a recommendation engine for suggesting and prioritizing fixes and changes to data center designs and layout.

This could also simplify the exchange of assets and configurations across teams. In the same way that IaC ensured that developer, test and operations teams were working with the same code. This will extend those same benefits across developers, network operators and data scientists that use this infrastructure.

The vision is that the digital twin helps teams lay out the data center down to each cable run. Then as teams start to install systems, the digital twin makes it easier to ensure that each cable is run correctly, and, if not, what needs to change. Then if something goes wrong, such as an outage or a power supply goes down, the digital twin could help test out different remedies. Teams could test out various fixes beforehand to make changes with higher confidence of success.

This would help complete the loop between the greater flexibility available in the cloud with the better economics available for on-premise deployments.

"You can think of it as cloud agility with on-prem economics," Katz said.

Lost assets

Design for reuse

Devops for hardware

More