Google’s fourth-generation tensor processing units (TPUs), the existence of which weren’t publicly revealed until today, can complete AI and machine learning training workloads in close-to-record wall clock time. That’s according to the latest set of metrics released by MLPerf, the consortium of over 70 companies and academic institutions behind the MLPerf suite for AI performance benchmarking. It shows clusters of fourth-gen TPUs surpassing the capabilities of third-generation TPUs — and even those of Nvidia’s recently released A100 — on object detection, image classification, natural language processing, machine translation, and recommendation benchmarks.

Google says its fourth-generation TPU offers more than double the matrix multiplication TFLOPs of a third-generation TPU, where a single TFLOP is equivalent to 1 trillion floating-point operations per second. (Matrices are often used to represent the data that feeds into AI models.) It also offers a “significant” boost in memory bandwidth while benefiting from unspecified advances in interconnect technology. Google says that overall, at an identical scale of 64 chips and not accounting for improvement attributable to software, the fourth-generation TPU demonstrates an average improvement of 2.7 times over third-generation TPU performance in last year’s MLPerf benchmark.

Google’s TPUs are application-specific integrated circuits (ASICs) developed specifically to accelerate AI. They’re liquid-cooled and designed to slot into server racks; deliver up to 100 petaflops of compute; and power Google products like Google Search, Google Photos, Google Translate, Google Assistant, Gmail, and Google Cloud AI APIs. Google announced the third generation in 2018 at its annual I/O developer conference and this morning took the wraps off the successor, which is in the research stages.

“This demonstrates our commitment to advancing machine learning research and engineering at scale and delivering those advances to users through open-source software, Google’s products, and Google Cloud,” Google AI software engineer Naveen Kumar wrote in a blog post. “Fast training of machine learning models is critical for research and engineering teams that deliver new products, services, and research breakthroughs that were previously out of reach.”

This year’s MLPerf results suggest Google’s fourth-generation TPUs are nothing to scoff at. On an image classification task that involved training an algorithm (ResNet-50 v1.5) to at least 75.90% accuracy with the ImageNet data set, 256 fourth-gen TPUs finished in 1.82 minutes. That’s nearly as fast as 768 Nvidia A100 graphics cards combined with 192 AMD Epyc 7742 CPU cores (1.06 minutes) and 512 of Huawei’s AI-optimized Ascend910 chips paired with 128 Intel Xeon Platinum 8168 cores (1.56 minutes). Third-gen TPUs had the fourth-gen beat at 0.48 minutes of training, but perhaps only because 4,096 third-gen TPUs were used in tandem.

Google TPU MLPerf

Above: A chart showing improvements from Google’s third-gen to fourth-gen tensor processing units (TPUs).

Image Credit: Google

In MLPerf’s “heavy-weight” object detection category, the fourth-gen TPUs pulled slightly further ahead. A reference model (Mask R-CNN) trained with the COCO corpus in 9.95 minutes flat on 256 fourth-gen TPUs, coming within striking distance of 512 third-gen TPUs (8.13 minutes). And on a natural language processing workload entailing training a Transformer model on the WMT English-German data set, 256 fourth-gen TPUs finished in 0.78 minutes. It took 4,096 third-gen TPUs 0.35 minutes and 480 Nvidia A100 cards (plus 256 AMD Epyc 7742 CPU cores) 0.62 minutes.

The fourth-gen TPUs also scored well when tasked with training a BERT model on a large Wikipedia corpus. Training took 1.82 minutes with 256 fourth-gen TPUs, only slightly slower than the 0.39 minutes it took with 4,096 third-gen TPUs. Meanwhile, achieving a 0.81-minute training time with Nvidia hardware required 2,048 A100 cards and 512 AMD Epyc 7742 CPU cores.

This latest MLPerf included new and modified benchmarks — Recommendation and Reinforcement Learning — and results were mixed for the TPUs. A cluster of 64 fourth-gen TPUs performed well on the Recommendation task, taking 1.12 minutes to train a model on 1TB of logs from Criteo AI Lab’s Terabyte Click-Through-Rate (CTR) data set. (Eight Nvidia A100 cards and two AMD Epyc 7742 CPU cores finished training in 3.33 minutes.) But Nvidia pulled ahead in Reinforcement Learning, managing to train a model to a 50% win rate in a simplified version of the board game Go in 29.7 minutes with 256 A100 cards and 64 AMD Epyc 7742 CPU cores. It took 256 fourth-gen TPUs 150.95 minutes.

One point to note is that Nvidia hardware was benchmarked on Facebook’s PyTorch framework and Nvidia’s own frameworks as opposed to Google TensorFlow; both third- and fourth-gen TPUs used TensorFlow, JAX, and Lingvo. While that might have influenced the results somewhat, even allowing for that possibility, the benchmarks make clear the fourth-gen TPU’s performance strengths.