Big Data

Google unveils a big-data pipeline for batch and stream processing in its cloud

Google servers Douglas County Georgia
Image Credit: Google
Gaming execs: Join 180 select leaders from King, Glu, Rovio, Unity, Facebook, and more to plan your path to global domination in 2015. GamesBeat Summit is invite-only -- apply here. Ticket prices increase on April 3rd!

SAN FRANCISCO — Google made a big contribution to the big data world 10 years ago, when it released a paper on MapReduce, a programming model for doing big computing jobs on hefty data sets. But it turns out that, all this time, Google has been working on something far more advanced.

At Google’s annual I/O shindig today, the tech giant announced a service that can do much, much more than MapReduce: Google Cloud Dataflow. It can either run a series of computing jobs, batch-style, or do constant work as data flows in. Engineers can start using the service in Google’s burgeoning public cloud. Google takes care of managing the thing.

“We handle all the infrastructure and the back-end work required to scale up and scale down, depending on the kind of data needs that you have,” Brian Goldfarb, head of marketing for the Google Cloud Platform, told VentureBeat ahead of Google I/O.

Google Cloud Dataflow is Google’s response to public-cloud market leader Amazon Web Services’ Kinesis stream-processing service, which was first announced in November. The service is one more tool that helps flesh out Google’s cloud offering in a highly competitive business.

The new service draws from technologies Google has developed in recent years, including the FlumeJava library for running data pipelines in parallel and the MillWheel stream-processing framework.

What’s interesting is that Google, like some other companies, has gotten over, or moved on from, the MapReduce technology it pioneered.

“It’s funny — we’ve been doing massive data for a long time here,” Goldfarb said. “We’ve learned a few things, and one of the things we’ve learned is we don’t want to use MapReduce anymore.”

With MapReduce, ingestion of data — before it gets transformed or analyzed — can be tough and time-consuming. And with more connected devices offering up data for immediate analysis, MapReduce wasn’t the best fit. Time would be better spent figuring out the best way to analyze data, not kicking off long-running MapReduce jobs and simultaneously tinkering with different code to do stream processing with open-source tools like Storm. Hence the development of new, hybrid tools.

And Cloud Dataflow is already showing value at Google.

“It’s the way we do all of our internal analysis,” Goldfarb said.

More information:

Google's innovative search technologies connect millions of people around the world with information every day. Founded in 1998 by Stanford Ph.D. students Larry Page and Sergey Brin, Google today is a top web property in all major glob... read more »

Compute Engine is an infrastructure as a service that lets you run your large-scale computing workloads on Linux virtual machines hosted on Google's infrastructure.... read more »

Google Cloud Platform enables developers to build, test and deploy applications on Google’s highly-scalable and reliable infrastructure. Choose from computing, storage and application services for your web, mobile and backend solutio... read more »

Powered by VBProfiles

VentureBeat’s VB Insight team is studying email marketing tools. Chime in here, and we’ll share the results.