Google unveils a big-data pipeline for batch and stream processing in its cloud

SAN FRANCISCO -- Google made a big contribution to the big data world 10 years ago, when it released a paper on MapReduce, a programming model for doing big computing jobs on hefty data sets. But it turns out that, all this time, Google has been working on something far more advanced.

At Google's annual I/O shindig today, the tech giant announced a service that can do much, much more than MapReduce: Google Cloud Dataflow. It can either run a series of computing jobs, batch-style, or do constant work as data flows in. Engineers can start using the service in Google's burgeoning public cloud. Google takes care of managing the thing.

"We handle all the infrastructure and the back-end work required to scale up and scale down, depending on the kind of data needs that you have," Brian Goldfarb, head of marketing for the Google Cloud Platform, told VentureBeat ahead of Google I/O.

Google Cloud Dataflow is Google's response to public-cloud market leader Amazon Web Services' Kinesis stream-processing service, which was first announced in November. The service is one more tool that helps flesh out Google's cloud offering in a highly competitive business.

The new service draws from technologies Google has developed in recent years, including the FlumeJava library for running data pipelines in parallel and the MillWheel stream-processing framework.

What's interesting is that Google, like some other companies, has gotten over, or moved on from, the MapReduce technology it pioneered.

"It's funny -- we've been doing massive data for a long time here," Goldfarb said. "We've learned a few things, and one of the things we've learned is we don't want to use MapReduce anymore."

With MapReduce, ingestion of data -- before it gets transformed or analyzed -- can be tough and time-consuming. And with more connected devices offering up data for immediate analysis, MapReduce wasn't the best fit. Time would be better spent figuring out the best way to analyze data, not kicking off long-running MapReduce jobs and simultaneously tinkering with different code to do stream processing with open-source tools like Storm. Hence the development of new, hybrid tools.

And Cloud Dataflow is already showing value at Google.

"It's the way we do all of our internal analysis," Goldfarb said.

More