Google is making it faster for cloud customers to process data for analysis with a forthcoming feature called Cloud Dataflow Shuffle. It’s designed to make consuming streaming and batch processed data up to five times faster than before by applying technology the tech giant developed in-house.
The feature is built for Google’s Cloud Dataflow service, which helps customers process data before feeding it into databases, machine learning applications, and other systems. Customers set up processing tasks in Cloud Dataflow using pipelines written with the Apache Beam SDK, then Google handles the provisioning and scaling of compute resources necessary to handle those tasks.
Cloud Dataflow Shuffle accelerates those pipelines by using a Google-made system to manage shuffle operations, which sort data from multiple compute nodes. When this launches, customers will get the benefits at no extra cost. All of that’s possible because Google manages the Cloud Dataflow service and is able to swap in new features and components whenever it’s necessary and possible.
The feature may also help attract and retain customers who might otherwise choose to run Beam pipelines elsewhere. While Google created the SDK, it’s possible for users to deploy pipelines on Apache Flink, Spark, Apex, and Gearpump clusters running in other locations as well.
The value of Cloud Dataflow Shuffle depends on the extent to which a Beam pipeline relies on shuffle operations, according to William Vambenepe, a group product manager on the Google Cloud Platform team.
“You have pipelines that do barely any shuffle,” he said. “There is only so much a shuffle accelerator is going to do if you don’t shuffle.”
However, he said that many of the pipelines that took the longest to run required extensive use of shuffle operations. Customers in those situations will get a free speed boost.
Google knows a thing or two about shuffle operations. The company’s engineers once ran a 50PB (1PB is 1,000TB) shuffle on the servers inside a freshly minted Google data center before it came online for testing purposes.