Google's Dataflow pipeline tool can now run on Spark, thanks to Cloudera

If you wish to process huge piles of data very, very quickly, you're in luck.

From the comfort of your own data center, you can now use Google's recently announced Dataflow programming model for processing data in batches or as it comes in, on top of the fast Spark open-source engine.

Cloudera, one company selling a distribution of Hadoop open-source software for storing and analyzing large quantities of different kinds of data, has been working with Google to make that possible, and the results of their efforts are now available for free under an open-source license, the two companies announced today.

The technology could benefit the burgeoning Spark ecosystem, as well as Google, which wants programmers to adopt its Dataflow model. If that happens, developers might well feel more comfortable storing and crunching data on Google's cloud.

Google last year sent shockwaves through the big data world it helped create when Urs Hölzle, Google's senior vice president of technical infrastructure, announced that Googlers "don't really use MapReduce anymore." In lieu of MapReduce, which Google first developed more than 10 years ago and still lies at the heart of Hadoop, Google has largely switched to a new programming model for processing data in streaming or batch format.

Google has brought out a commercial service for running Dataflow on the Google public cloud. And late last year it went further and issued a Java software-development kit for Dataflow.

All the while, outside of Google, engineers have been making progress. Spark in recent years has emerged as a potential MapReduce successor.

Now there's a solid way to use the latest system from Google on top of Spark. And that could be great news from a technical standpoint.

"[Dataflow's] streaming execution engine has strong consistency guarantees and provides a windowing model that is even more advanced than the one in Spark Streaming, but there is still a distinct batch execution engine that is capable of performing additional optimizations to pipelines that do not process streaming data," Josh Wills, Cloudera's senior director of data science, wrote in a blog post on the news.

More