Computer makers are unveiling a total of 50 servers with Nvidia’s A100 graphics processing units (GPUs) to power AI, data science, and scientific computing applications. The first GPU based on the Nvidia Ampere architecture, the A100 is the company’s largest leap in GPU performance to date, with features such as the ability for one GPU to be partitioned into seven separate GPUs as needed, Nvidia said. The company made the announcement ahead of the ISC High Performance online event dedicated to high-performance computing. Nvidia said it now has eight of the top 10 fastest supercomputers in the world, as measured by ISC.
Unveiled in May, the A100 GPU has 54 billion transistors (the on-off switches that are the building blocks of all things electronic) and a server with eight A100 GPUs like the Nvidia DGX A100 can execute 5 petaflops of performance, or about 20 times more than the previous-generation chip Volta. This means central processing unit (CPU) servers that cost $20 million and take up 22 racks can be replaced by new servers that cost $3 million and take up just four GPU-based server racks, said Nvidia product marketing director Paresh Kharya in a press briefing.
The systems are coming from computer makers that include Asus, Atos, Cisco, Dell, Fujitsu, Gigabyte, Hewlett Packard Enterprise, Inspur, Lenovo, One Stop Systems, Quanta/QCT, and Supermicro. Server availability varies, with 30 systems expected this summer and over 20 more by the end of the year, Kharya said.
The latest machines include new InfiniBand interconnect technology from Mellanox, which Nvidia paid $7 billion to acquire in 2019. Nvidia integrated Mellanox technology with the A100 to create Selene, which Nvidia bills as a top 10 supercomputer and one of the world’s most energy-efficient computers. Selene was designed in less than a month and provides over 1 exaflop of AI processing. Kharya said supercomputers like Selene will help Nvidia further penetrate the world’s top supercomputers.
“While the A100 PCIe was to be expected, and the wins for A100 in high-performance computing are impressive, the in-house Selene supercomputer forms a competitive moat that will be tough for competitors to cross,” said Karl Freund, analyst at Moor Insights & Strategy, in an email.
Last year, Nvidia’s graphics processing units (GPUs) were part of 125 of the top 500 supercomputers in the world, according to ISC. If you count the supercomputers with Mellanox InfiniBand technology, that number is more than 300, and the list is expected to grow even larger in 2020.
“If you look at the top 500 list, the reason why Nvidia is so successful in supercomputing is because scientific computing has changed,” Kharya said. “We’ve entered a new era, one that has expanded beyond traditional modeling and simulation workloads to include AI, data analytics, edge screening, and big data visualization.”
Kharya said Mellanox interconnect chips power the world’s leading weather forecast supercomputers. Weather and climate models are both compute- and data-intensive. Forecast quality depends on the model’s complexity and level of resolution. And supercomputer performance depends on interconnect technology to move data quickly across different computers.
“It’s exciting to have the best compute on one side and the best network on the other, and now we can start to combine those technologies together and start building amazing things,” said Nvidia senior VP Gilad Shainer in a press briefing.
Customers using Mellanox include the Spanish Meteorological Agency, the China Meteorological Administration, the Finnish Meteorological Institute, NASA, and the Royal Netherlands Meteorological Institute.
The Beijing Meteorological Service has selected 200 Gigabit HDR InfiniBand interconnect technology to accelerate its new supercomputing platform, which will be used to enhance weather forecasting, improve climate and environmental research, and serve the weather forecasting information needs of the 2022 Winter Olympics in Beijing.
Nvidia said it has been able to run the RAPIDS suite of open source data science software in just 14.5 minutes, breaking the previous record of performance by 19.5 times. (A rival CPU system does the same task in 4.7 hours.) Nvidia owes its gains to its new Nvidia DGX A100 systems using the Nvidia A100 artificial intelligence GPU chip. The 16 Nvidia DGX A100 systems used in the benchmark test had a total of 128 Nvidia A100 GPUs with Mellanox interconnects. The company also unveiled the Nvidia Mellanox UFM Cyber-AI platform, which minimizes downtime in InfiniBand datacenters by harnessing AI-powered analytics to detect security threats and operational issues.
This extension of the UFM platform product portfolio — which has managed InfiniBand systems for nearly a decade — applies AI to learn a datacenter’s operational cadence and network workload patterns. It draws on both real-time and historic telemetry and workload data. Against this baseline, it tracks the system’s health and network modifications and detects performance problems.
The new platform provides alerts of abnormal system and application behavior and potential system failures and threats, as well as performing corrective actions. It also delivers security alerts in cases of attempted system hacking, such as cryptocurrency mining. The result is reduced datacenter downtime — which typically costs more than $300,000 an hour, according to the ITIC 2020 report.
Senior adviser Steve Conway of Hyperion Research said in an email, “It’s very impressive that Nvidia keeps innovating at a fast pace. The most noteworthy innovation in my opinion is the integration of Tensor processing cores into the GPUs. The Tensor cores, now in their third generation, address some AI problems that the GPU cores don’t handle well. That’s important, because AI chip startups are starting to challenge Nvidia GPUs with Tensor processors and other technologies that are designed address certain classes of AI problems very effectively.”
Fighting the coronavirus
Kharya said Nvidia’s scientific computing platform has been enlisted in the fight against COVID-19. In genomics, Oxford Nanopore Technologies was able to sequence the virus genome in just seven hours using Nvidia GPUs. For infection analysis and prediction, the Nvidia RAPIDS team has helped create the GPU-accelerated Plotly’s Dash, a data visualization tool that enables clearer insights into real-time infection rate analysis.
Nvidia’s tools can be used to predict the availability of hospital resources across the U.S. In structural biology, the U.S. National Institutes of Health and the University of Texas, Austin are using GPU-accelerated software CryoSPARC to reconstruct the first 3D structure of the virus protein using cryogenic electron microscopy.
In treatment, Nvidia worked with the National Institutes of Health and built AI to accurately classify COVID-19 infection based on lung scans so doctors can devise efficient treatment plans. In drug discovery, Oak Ridge National Laboratory ran the Scripps Research Institute’s AutoDock on the GPU accelerated Summit Supercomputer to screen a billion potential drug combinations in just 12 hours.
In robotics, startup Kiwi is building robots to deliver medical supplies autonomously while in edge detection, Whiteboard Coordinator built an AI system to automatically measure elevated body temperatures, screening well over 2,000 health care workers per hour. In total, Nvidia accelerates more than 700 high-performance computing applications.
“Nvidia’s vision is that accelerated computing would impact every part of our lives and they’ve certainly delivered on it. The ISC announcements were about bringing the benefits of accelerated computing to supercomputers,” said Zeus Kerravala, principal analyst at ZK Research, in an email. “The fact that Nvidia is used almost exclusively by the top 500 systems is a testament to the benefits they bring over other GPU manufacturers. One of the differentiators for Nvidia is they’ve extended the concept of supercomputing to be outside the box. Supercomputing now includes connectivity to the edge, cloud, AI systems, and other areas.”
He added, “These get tied together with the network. In such an environment, the network essentially becomes the backplane of a distributed computer and needs to be fast, ultra low latency, lossless, and take on the characteristics of what was once directly connected. This is why they bought Mellanox. Although Ethernet speeds are faster than InfiniBand, InfiniBand is still superior as a computing connectivity protocol. It’s this combination of networking plus computing that creates an ‘Nvidia system’ that can deliver the compute power to do things we could never do before.”
Updated: 8:12 a.m. Pacific Time 6/22/20 with top 500 list and analyst quotes.