Broadcom turbocharges AI and ML with Tomahawk 5

Artificial intelligence (AI) and machine learning (ML) are about more than algorithms: The right hardware to turbocharge your AI and ML computations is key.

To speed up job completion, AI and ML training clusters need high bandwidth and dependable transport with predictable low-tail latency (tail latency is the 1 or 2% of a job that trails the rest of responses). A high-performance interconnection can optimize data center and high-performance computing (HPC) workloads across your portfolio of hyperconverged AI and ML training clusters, resulting in lower latency for better model training, increased data packet utilization and lower operational costs.

Today, San Jose-based Broadcom announced its contribution to the need for high-performance interconnections: the StrataXGS Tomahawk 5 switch series, which offers 51.2 Tbps of Ethernet switching capacity in a single, monolithic device – more than double the bandwidth of its contemporaries, the company claims.

"Tomahawk 5 has twice the capacity of Tomahawk 4. As a result, it is one of the world's fastest-switching chips," said Ram Velaga, senior vice president and general manager of Broadcom's core switching group. "The newly added specific features and capabilities to optimize performance for AI and ML networks make [the] Tomahawk 5 twice as fast as the previous version."

Ethernet switching for performance optimization

While network bandwidth requirements in data centers continue to rise dramatically, there is also a strong push to combine general compute and storage infrastructure with optimized AI and ML training processors. As a result, AI and ML training clusters — where you specify multiple machines for training — are driving the demand for fabrics with high-bandwidth connectivity, high radix and faster job completion while operating at high network utilization.

To speed up job completion, it’s critical to have effective load balancing to achieve high network utilization, as well as congestion-control mechanisms to achieve predictable tail latency. Virtualized and efficient data infrastructures, combined with capable hardware, can also improve CPU offloads and assist network accelerators in improving neural network training.

Ethernet-based infrastructures currently offer the best solution for a unified network. They combine low power with high bandwidth and radix, and the fastest serializer and deserializer (SerDes) speeds, with a predictable doubling of bandwidth every 18 to 24 months. With these advantages, as well as its large ecosystem, Ethernet can provide the highest performance interconnect per watt and dollar for AI and ML and cloud-scale infrastructure.

New Broadcom chip doubles speed and capacity

According to IDC, the global Ethernet switch market grew 12.7% year-on-year to $7.6 billion in the first quarter of 2022 (1Q22). Broadcom offers the Tomahawk family of Ethernet switches to enable the next generation of unified networks.

The Tomahawk 5 switch chips are designed to aid data centers and HPC environments, to accelerate AI and ML capabilities. The switch chip uses a Broadcom approach known as cognitive routing, an advanced shared-packet buffering, programmable in-band telemetry, with hardware-based link failover built into the chip.

Cognitive routing optimizes network link utilization by automatically selecting the system's least heavily loaded links for each flow that passes through the switch. This is especially important for AI and ML workloads, which frequently combine short- and long-lived high-bandwidth flows with low entropy.

"Cognitive routing is a step beyond adaptive routing,” Velaga said. “When using adaptive routing, you are only aware of data congestion between two points but are unaware of the other ends."

Cognitive routing, he added, can make the system aware of conditions apart from the next neighbor, rerouting for an optimal path that provides better load balance while avoiding congestion.

Tomahawk 5 includes real-time dynamic load balancing, which monitors the use of all links at the switch and downstream in the network to determine the best path for each flow. It also monitors the status of hardware links and automatically redirects traffic away from failed connections. These features improve network utilization and reduce congestion, resulting in a shorter job completion time.

The future of Ethernet for AI and ML infrastructures

Ethernet has the characteristics required for high-performance AI and ML training clusters: high bandwidth, end-to-end congestion management, load balancing and fabric management at a lower cost than its contemporaries, such as InfiniBand.

It’s clear that Ethernet is a robust ecosystem that is constantly developing at a rapid pace of innovation. "Ethernet is relentless, and I would expect it to continue encroaching on areas like AI/ML," Craig Matsumoto, senior research analyst at 451 Research, told VentureBeat. "The reward is homogeneity – if I can run every workload on Ethernet, assuming the performance is good enough, I can have one homogenous network that all workloads can share. It's simpler, and it buys me more redundant paths for forwarding traffic."

Ethernet switching for performance optimization

New Broadcom chip doubles speed and capacity

The future of Ethernet for AI and ML infrastructures

More