Google today announced that it has made a proposal to submit its Dataflow data processing technology to the Apache Software Foundation (ASF) in order to make Dataflow an Apache incubator project and thereby introduce broader governance and transparency around the software.
Dataflow is interesting because it can handle both batch and stream processing of large data sets. It goes far beyond the MapReduce technology at the core of the Hadoop open source big data software that Google first documented in a paper in 2004. Google Cloud Dataflow is a managed implementation of Dataflow on Google’s public cloud that developers can incorporate into their applications.
The Dataflow Java software development kit (SDK), which first appeared in December 2014, is already available under an open source Apache license. This would come under the jurisdiction of the Apache project, along with Apache Spark and Apache Flink runners, the Dataflow programming model, and the forthcoming Dataflow Python SDK, Google software engineer Frances Perry and product manager James Malone wrote in a blog post.
The full proposal offers an explanation of what Google is seeking from the move:
As a project under incubation, we are committed to expanding our effort to build an environment which supports a meritocracy. We are focused on engaging the community and other related projects for support and contributions. Moreover, we are committed to ensure contributors and committers to Dataflow come from a broad mix of organizations through a merit-based decision process during incubation. We believe strongly in the Dataflow model and are committed to growing an inclusive community of Dataflow contributors.
Being an Apache project can also lend more legitimacy to open source software than just putting it up on GitHub. Cloudera, which did work to make Dataflow support the Spark data processing engine, submitted a proposal to make its Kudu storage engine into an Apache incubator project.
“We believe this proposal is a step towards the ability to define one data pipeline for multiple processing needs, without tradeoffs, which can be run in a number of runtimes, on-premise, in the cloud, or locally,” Perry and Malone wrote.
More companies could well adopt Dataflow even in their own data centers, but when they do want to run in the cloud, Google can handle that with Cloud Dataflow. And that’s important for Google’s continuing effort to challenge Amazon Web Services and Microsoft Azure in the public cloud business.
Update on February 1: The proposal to make Dataflow into an Apache incubator project has been accepted, under the new name Apache Beam, according to an email today from project champion Jean-Baptiste Onofré. “I will now work with infra to get the resources for the project,” he wrote.