AMD claims its Frontier supercomputer will hit 1.5 exaflops, making it the world's fastest

AMD today announced that it will partner with Cray to build Frontier, a supercomputer capable of "exascale" performance -- one that can complete at least a quintillion floating point computations ("flops") per second, where a flop equals two 15-digit numbers multiplied together -- for weather system simulation, subatomic particle modeling, and more. The two companies expect it will be the world's fastest supercomputer when it's delivered in 2021, with more than 1.5 exaflops of theoretical performance -- roughly 50 times the speed of today's top supercomputers and faster than the top 160 combined. Frontier will be built at Oak Ridge National Laboratory in Oak Ridge, Tennessee.

"AMD is proud to partner with Cray and [Oak Ridge National Laboratory] to deliver what is expected to be the world's most powerful supercomputer," said Forrest Norrod, SVP and GM of AMD's datacenter and embedded systems group. "Frontier will feature custom CPU and GPU technology from AMD and represents the latest achievement on a long list of technology innovations AMD has contributed to the Department of Energy exascale programs."

During a dial-in conference with members of the press, AMD threw around some eye-popping figures. Frontier's network bandwidth will be 24 million times greater than the average home internet connection, or speedy enough to download 100,000 HD movies in a second. The system will have a physical footprint spanning 7,300 square feet, the equivalent of two basketball courts. And Frontier's internal cable and wiring would run all the way from Philadelphia to New York City (about 90 miles) if laid out flat end-to-end.

Driving Frontier's breakthrough compute is what AMD claims is the first "fully optimized" GPU and CPU design for supercomputing. It features a custom AMD Epyc processor packing a future Zen core architecture designed for high-performance computing (HPC) and AI workloads, along with a graphics processing unit (GPU) in AMD's Radeon Instinct product lineup of server accelerators. The GPUs feature HPC engines, "extensive" mixed precision operations, and high-bandwidth memory, and they're linked together -- one Epyc processor to four Instinct graphics cards -- by AMD's Infinity Fabric and Cray Slingshot high-bandwidth system interconnect architectures.

Beyond AMD's bespoke graphics and processor combo, Frontier will incorporate Cray's containerized Shasta software for monitoring, orchestration, adaptive routing, quality-of-service, and congestion management. Moreover, the company says it will architect a high-efficiency direct liquid cooling solution for Frontier and with a separate joint contract will pursue "new ... technologies," including a high-density compute infrastructure and enhancements to its HPC developer tools for GPU scaling and AI.

Cray says these forthcoming tools, which will be codeveloped with AMD, will take advantage of AMD's Radeon Open Compute Platform (ROCm) to enable direct communication between Frontier's network interface cards and GPU memory. And Cray says that Cray Programming Environment (PE), the company's eponymous software development suite for HPC apps, will be integrated with a machine learning software stack that will offer support for "the most popular tools and frameworks."

"We are excited to work with the team at AMD to deliver the Frontier system to Oak Ridge National Laboratory," said Cray SVP and CTO Steve Scott. "Cray’s Shasta supercomputers are designed to support leading-edge processor technologies and high-performance storage, all tightly interconnected by Cray's ... Slingshot network. The combination of Cray and AMD technology in the Frontier system will dramatically enhance performance at scale for AI, analytics, and simulation, enabling DOE to further push the boundaries of scientific discovery."

As work progresses on Frontier, Cray and Oak Ridge will establish a Center of Excellence at the lab to "drive collaboration and innovation" and assist in the porting and tuning of Department of Energy apps and libraries. Chiefly, the center will be responsible for modernizing new and legacy code and providing training and hands-on workshops.

"Frontier represents the state of the art in high-performance computing. Designing and standing up a machine of its scope requires working closely with industry, partnerships that not only enable breakthrough science but also ensure American scientific and economic competitiveness on the global stage," said Oak Ridge's associate laboratory director for computing and computational sciences, Jeff Nichols. "We are delighted to work with AMD to integrate the CPU and GPU technologies that enable this extremely capable accelerated node architecture."

Frontier is an outgrowth of the Energy Department's Exascale Computing Project (ECP), a grant program within its long-running PathForward initiative, which seeks to accelerate research necessary to develop exascale supercomputers in the U.S. Nearly $258 million in funding has been allocated over a three-year contract period starting 2017, and the companies selected to participate -- Cray, Intel, Hewlett Packard Enterprise, IBM, and Nvidia, in addition to AMD -- were required to supply supplementary financing amounting to at least 40% of their total project cost.

More recently, in April the Energy Department opened requests for two exascale systems as part of its CORAL-2 procurement, with a budget ranging from $800 million to $1.2 billion. Frontier is one of three follow-on machines that are part of CORAL-2, the others being El Capitan at Lawrence Livermore National Laboratory in Livermore, California and Intel's Aurora at Argonne National Laboratory in Chicago. AMD says the contract award for Frontier is valued at more than $600 million.

The Department of Energy previously awarded $425 million in federal funding to IBM, Nvidia, and other companies to build two supercomputers: one at Oak Ridge and another at Livermore. The current Oak Ridge system -- Summit, which will be replaced by Frontier -- delivers between 143 and 200 peak petaflops, according to the TOP500 ranking of supercomputer performance (based on LINPACK score), while Livermore's Sequoia cluster tops out at about 20 petaflops. Both Summit and Sierra were built by IBM and pack IBM Power9 processors and Nvidia Tesla V100 accelerator chips, and both consume enormous amounts of power -- up to 13MW, in Summit's case.

Assuming AMD delivers on its promise, Frontier will be the crown jewel in the U.S.' supercomputer portfolio, but it might not be the most powerful in the world. Three teams in China -- in Tianjin (prototype), Jinan, and Beijing -- are actively competing to build China's first exascale system in the next seven months, and Japan's Post-K exascale computer has a target deployment date of 2020.

Currently, the U.S. hosts five of the 10 fastest computers in the world, with China's best -- the TaihuLight at the National Supercomputing Center in Wuxi (built on Sunway’s SW26010 processor architecture) and the Tianhe-2A in Guangzhou -- ranking third and fourth, respectively, at roughly 125 peak petaflops and 100 peak petaflops. Cray's Piz Daint sits in fifth, ahead of Trinity at Los Alamos National Laboratory, Fujitsu's AI Bridge Clouding Infrastructure in Japan, and Lenovo's SuperMUC-NG in Germany.

The race between China and the U.S. is fierce. In TOP500 rankings, China two years ago surpassed the United States in total number of ranked supercomputers for the first time, with 202 to 143. That trend accelerated the following year; according to the TOP500 fall 2018 report, the number of ranked U.S. supercomputers fell to 108 as China's total climbed to 229.

China and the U.S. are followed in the largest number of ranked supercomputers by Japan, which has 31 systems; the U.K., with 20; France with 18; Germany with 17; and Ireland with 12.