Big Data

Google unveils a big-data pipeline for batch and stream processing in its cloud

Image Credit: Google
NOTE: GrowthBeat -- VentureBeat's provocative new marketing-tech event -- is a week away! We've gathered the best and brightest to explore the data, apps, and science of successful marketing. Get the full scoop here, and grab your tickets while they last.

SAN FRANCISCO — Google made a big contribution to the big data world 10 years ago, when it released a paper on MapReduce, a programming model for doing big computing jobs on hefty data sets. But it turns out that, all this time, Google has been working on something far more advanced.

At Google’s annual I/O shindig today, the tech giant announced a service that can do much, much more than MapReduce: Google Cloud Dataflow. It can either run a series of computing jobs, batch-style, or do constant work as data flows in. Engineers can start using the service in Google’s burgeoning public cloud. Google takes care of managing the thing.

“We handle all the infrastructure and the back-end work required to scale up and scale down, depending on the kind of data needs that you have,” Brian Goldfarb, head of marketing for the Google Cloud Platform, told VentureBeat ahead of Google I/O.

Google Cloud Dataflow is Google’s response to public-cloud market leader Amazon Web Services’ Kinesis stream-processing service, which was first announced in November. The service is one more tool that helps flesh out Google’s cloud offering in a highly competitive business.

The new service draws from technologies Google has developed in recent years, including the FlumeJava library for running data pipelines in parallel and the MillWheel stream-processing framework.

What’s interesting is that Google, like some other companies, has gotten over, or moved on from, the MapReduce technology it pioneered.

“It’s funny — we’ve been doing massive data for a long time here,” Goldfarb said. “We’ve learned a few things, and one of the things we’ve learned is we don’t want to use MapReduce anymore.”

With MapReduce, ingestion of data — before it gets transformed or analyzed — can be tough and time-consuming. And with more connected devices offering up data for immediate analysis, MapReduce wasn’t the best fit. Time would be better spent figuring out the best way to analyze data, not kicking off long-running MapReduce jobs and simultaneously tinkering with different code to do stream processing with open-source tools like Storm. Hence the development of new, hybrid tools.

And Cloud Dataflow is already showing value at Google.

“It’s the way we do all of our internal analysis,” Goldfarb said.

More about the companies and people from this article:

Google's innovative search technologies connect millions of people around the world with information every day. Founded in 1998 by Stanford Ph.D. students Larry Page and Sergey Brin, Google today is a top web property in all major glob... read more »

Compute Engine is an infrastructure as a service that lets you run your large-scale computing workloads on Linux virtual machines hosted on Google's infrastructure.... read more »

Powered by VBProfiles


We're studying digital marketing compensation: how much companies pay CMOs, CDOs, VPs of marketing, and more, with ChiefDigitalOfficer. Help us out by filling out the survey, and we'll share the results with you.
1 comments