AMD and Cray to Create 1.5 Exaflops of AI and HPC Processing Performance in New Frontier Supercomputer
AMD joined the U.S. Department of Energy (DOE), Oak Ridge National Laboratory (ORNL) and Cray Inc. in announcing what is expected to be the world’s fastest exascale-class supercomputer, scheduled to be delivered to ORNL in 2021.
The Frontier system will be based on Cray’s new Shasta architecture and Cray Slingshot interconnect and will feature future generation AMD EPYC CPU and Radeon Instinct GPU technology.
To deliver what is expected to be more than 1.5 exaflops of processing performance, the Frontier system is designed to use future generation High Performance Computing (HPC) and Artificial Intelligence (AI) optimized, custom AMD EPYC CPU, and AMD Radeon Instinct GPU processors. Researchers at ORNL will use the Frontier system’s computing power and AI techniques to simulate, model and advance understanding of the interactions underlying the science of weather, sub-atomic structures, genomics, physics, and other important scientific fields.
When installed sometime in 2021, Frontier is expected offer slightly higher in performance than Aurora, the first U.S. exascale system being built by Intel and Cray.
The Frontier system will use "future-generation" High Performance Computing (HPC) and Artificial Intelligence (AI) optimized, custom AMD EPYC CPUs, and Radeon Instinct GPU processors supported by High Bandwidth Memory (HBM) and extensive mixed precision ops for optimum deep learning performance.
The system will also take advantage of a custom high-bandwidth, low-latency coherent Infinity Fabric, connecting four AMD Radeon Instinct GPUs to one AMD EPYC CPU per node.
AMD declined to say how many chips Frontier will use or what process the chips are made in, leaving analysts to speculate it could be TSMC’s 7+, 6 or even 5nm nodes.
The AMD processors will likely have improved double- and single-precision floating point performance but will also have 16-bit floating point as well for deep learning jobs. The ability to coherently attach four Radeon GPUs to the one Epyc CPU is a key design feature, as it gives each node tremendous computational performance.
Cray is designing a new AMD EPYC CPU and Radeon Instinct GPU powered blade for the Shasta high-density cabinet. The company will also engineer new high-efficiency power delivery and integrated direct liquid cooling capabilities for key server components.
Frontier will consist of more than 100 Cray Shasta cabinets, consuming about 40 MW of power.
To enable developer productivity, users will require a high-level software development environment with tightly-coupled compilers, tools, and libraries which abstract away system complexity. The Cray Programming Environment (Cray PE) will see a number of enhancements for increased functionality and scale. This will start with Cray working with AMD to enhance these tools for optimized GPU scaling with extensions for Radeon Open Compute Platform (ROCm). These software enhancements will leverage low-level integrations of AMD ROCmRDMA technology with Cray Slingshot to enable direct communication between the Slingshot NIC to read and write data directly to GPU memory for higher application performance. Slingshot is Cray’s top-of-rack switch that can support up to 250,000 nodes on a three-hop network running at 12.8 Tbits/second. Each switch packs 64 200-Gbit/s ports in a dragonfly topology and is compatible with Ethernet. Finally, Cray PE will be integrated with a full machine learning software stack with support for the most popular tools and frameworks.
Frontier will also utilize Cray’s new Shasta system software for monitoring, orchestration, and application development to provide a single developer interface across the system.
AMD has a long-standing engagement with DOE, starting with the Jaguar supercomputer in 2005 and Titan supercomputer in 2012. The contract award includes technology development funding, a center of excellence, several early-delivery systems, the main Frontier system and multi-year systems support. The new deal is a landmark for AMD’s renewed focus on high performance chips. To date, Intel has dominated as much as 95% of the CPU sockets in top supercomputers with IBM’s Power chips taking a significant slice of what remained.
The Frontier system is expected to be delivered in 2021 and acceptance is anticipated in 2022. Another system, dubbed El Capitan, is expected to be awarded to the team of IBM and Nvidia who built Summit and Sierra, the current leading supercomputers.
The U.S. Department of Energy will spend just short of $2 billion to commission the three exascale systems – Aurora, Frontier and the El Capitan system coming to Lawrence Livermore Lab.
In recent years, China has lead the Top 500 list several times and it now has more Top 500 supercomputers than the U.S.