Google today is making one of its cloud services, the Cloud Dataflow data-processing system, available in beta for engineers to try out.
Announced in June, Cloud Dataflow can handle batch and streaming workloads alike. It could come in handy for the dirty work of wrangling complex data — often referred to extract, transform, load (ETL) — before analysts can query it all.
Google has been accepting applications for a private alpha of the service for the past few months, but now anyone can try it.
“Today, nothing stands between you and the satisfaction of seeing your processing logic, applied in streaming or batch mode (your choice), via a fully managed processing service,” Google product manager William Vambenepe wrote in a blog post on the new feature. “Just write a program, submit it and Cloud Dataflow will do the rest. No clusters to manage, Cloud Dataflow will start the needed resources, autoscale them (within the bounds you choose) and terminate them as soon as the work is done.”
Cloud Dataflow relies on Google Compute Engine for raw computing power, as well as Google Cloud Storage and BigQuery for storing and accessing data. In other words, it’s an abstraction of several components of the Google Cloud Platform, a portfolio of cloud infrastructure tools that competes with Microsoft Azure and Amazon Web Services, among others.
Perhaps it’s not surprising that Amazon, the leader of the market, has also been building helpful abstractions of its underlying cloud-infrastructure services — Amazon’s Lambda event-driven computing service comes to mind. But Google is coming up quick, and Cloud Dataflow could be an appealing service for developers looking to build data-rich programs.
Google has been preparing for this rollout. In December, Google issued an open-source Java software-development kit (SDK) for Cloud Dataflow. And in January the company announced that it was collaborating with Hadoop distribution vendor Cloudera to make the Dataflow programming model compatible with Spark.
Also today, Google introduced new features for BigQuery, its managed data-analysis service. The service now lets users tack on permissions for access to specific rows of data in tables. Google has bumped up BigQuery’s default ingestion limit to 100,000 rows per second for each table. And it’s now possible for BigQuery users to isolate data in Google data centers in specific places around the world, including the company’s three zones in Europe.