Google used reinforcement learning to design next-gen AI accelerator chips

In a preprint paper published a year ago, scientists at Google Research including Google AI lead Jeff Dean described an AI-based approach to chip design that could learn from past experience and improve over time, becoming better at generating architectures for unseen components. They claimed it completed designs in under six hours on average, which is significantly faster than the weeks it takes human experts in the loop.

While the work wasn't entirely novel -- it built upon a technique Google engineers proposed in a paper published in March 2020 -- it advanced the state of the art in that it implied the placement of on-chip transistors can be largely automated. Now, in a paper published in the journal Nature, the original team of Google researchers claim they've fine-tuned the technique to design an upcoming, previously unannounced generation of Google's tensor processing units (TPU), application-specific integrated circuits (ASICs) developed specifically to accelerate AI.

If made publicly available, the Google researchers' technique could enable cash-strapped startups to develop their own chips for AI and other specialized purposes. Moreover, it could help to shorten the chip design cycle to allow hardware to better adapt to rapidly evolving research.

"Basically, right now in the design process, you have design tools that can help do some layout, but you have human placement and routing experts work with those design tools to kind of iterate many, many times over," Dean told VentureBeat in a previous interview. "It's a multi-week process to actually go from the design you want to actually having it physically laid out on a chip with the right constraints in area and power and wire length and meeting all the design roles or whatever fabrication process you're doing. We can essentially have a machine learning model that learns to play the game of [component] placement for a particular chip."

AI chip design

A computer chip is divided into dozens of blocks, each of which is an individual module, such as a memory subsystem, compute unit, or control logic system. These wire-connected blocks can be described by a netlist, a graph of circuit components like memory components and standard cells including logic gates (e.g., NAND, NOR, and XOR). Chip "floorplanning" involves placing netlists onto two-dimensional grids called canvases so that performance metrics like power consumption, timing, area, and wirelength are optimized while adhering to constraints on density and routing congestion.

Since the 1960s, many automated approaches to chip floorplanning have been proposed, but none has achieved human-level performance. Moreover, the exponential growth in chip complexity has rendered these techniques unusable on modern chips. Human chip designers must instead iterate for months with electronic design automation (EDA) tools, taking a register transfer level (RTL) description of the chip netlist and generating a manual placement of that netlist onto the chip canvas. On the basis of this feedback, which can take up to 72 hours, the designer either concludes that the design criteria have been achieved or provides feedback to upstream RTL designers, who then modify low-level code to make the placement task easier.

The Google team's solution is a reinforcement learning method capable of generalizing across chips, meaning that it can learn from experience to become both better and faster at placing new chips.

Gaming the system

Training AI-driven design systems that generalize across chips is challenging because it requires learning to optimize the placement of all possible chip netlists onto all possible canvases. In point of fact, chip floorplanning is analogous to a game with various pieces (e.g., netlist topologies, macro counts, macro sizes and aspect ratios), boards (canvas sizes and aspect ratios), and win conditions (the relative importance of different evaluation metrics or different density and routing congestion constraints). Even one instance of this "game" -- placing a particular netlist onto a particular canvas -- has more possible moves than the Chinese board game Go.

The researchers' system aims to place a "netlist" graph of logic gates, memory, and more onto a chip canvas, such that the design optimizes power, performance, and area (PPA) while adhering to constraints on placement density and routing congestion. The graphs range in size from millions to billions of nodes grouped in thousands of clusters, and typically, evaluating the target metrics takes from hours to over a day.

Starting with an empty chip, the Google team's system places components sequentially until it completes the netlist. To guide the system in selecting which components to place first, components are sorted by descending size; placing larger components first reduces the chance there's no feasible placement for it later.

Training the system required creating a dataset of 10,000 chip placements, where the input is the state associated with the given placement and the label is the reward for the placement (i.e., wirelength and congestion). The researchers built it by first picking five different chip netlists, to which an AI algorithm was applied to create 2,000 diverse placements for each netlist.

The system took 48 hours to "pre-train" on an Nvidia Volta graphics card and 10 CPUs, each with 2GB of RAM. Fine-tuning initially took up to 6 hours, but applying the pre-trained system to a new netlist without fine-tuning generated placement in less than a second on a single GPU in later benchmarks.

In one test, the Google researchers compared their system's recommendations with a manual baseline: the production design of a previous-generation TPU chip created by Google's TPU physical design team. Both the system and the human experts consistently generated viable placements that met timing and congestion requirements, but the AI system also outperformed or matched manual placements in area, power, and wirelength while taking far less time to meet design criteria.

Future work

Google says that its system's ability to generalize and generate "high-quality" solutions has "major implications," unlocking opportunities for co-optimization with earlier stages of the chip design process. Large-scale architectural explorations were previously impossible because it took months of effort to evaluate a given architectural candidate. However, modifying a chip's design can have an outsized impact on performance, the Google team notes, and might lay the groundwork for full automation of the chip design process.

Moreover, because the Google team's system simply learns to map the nodes of a graph onto a set of resources, it might be applicable to range of applications including city planning, vaccine testing and distribution, and cerebral cortex mapping. "[While] our method has been used in production to design the next generation of Google TPU ... [we] believe that [it] can be applied to impactful placement problems beyond chip design," the researchers wrote in the paper.