The Transform Technology Summits start October 13th with Low-Code/No Code: Enabling Enterprise Agility. Register now!
If you wish to process huge piles of data very, very quickly, you’re in luck.
From the comfort of your own data center, you can now use Google’s recently announced Dataflow programming model for processing data in batches or as it comes in, on top of the fast Spark open-source engine.
Cloudera, one company selling a distribution of Hadoop open-source software for storing and analyzing large quantities of different kinds of data, has been working with Google to make that possible, and the results of their efforts are now available for free under an open-source license, the two companies announced today.
The technology could benefit the burgeoning Spark ecosystem, as well as Google, which wants programmers to adopt its Dataflow model. If that happens, developers might well feel more comfortable storing and crunching data on Google’s cloud.
Google last year sent shockwaves through the big data world it helped create when Urs Hölzle, Google’s senior vice president of technical infrastructure, announced that Googlers “don’t really use MapReduce anymore.” In lieu of MapReduce, which Google first developed more than 10 years ago and still lies at the heart of Hadoop, Google has largely switched to a new programming model for processing data in streaming or batch format.
Google has brought out a commercial service for running Dataflow on the Google public cloud. And late last year it went further and issued a Java software-development kit for Dataflow.
All the while, outside of Google, engineers have been making progress. Spark in recent years has emerged as a potential MapReduce successor.
Now there’s a solid way to use the latest system from Google on top of Spark. And that could be great news from a technical standpoint.
“[Dataflow’s] streaming execution engine has strong consistency guarantees and provides a windowing model that is even more advanced than the one in Spark Streaming, but there is still a distinct batch execution engine that is capable of performing additional optimizations to pipelines that do not process streaming data,” Josh Wills, Cloudera’s senior director of data science, wrote in a blog post on the news.
VentureBeatVentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the subjects of interest to you
- our newsletters
- gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
- networking features, and more