Google announces new tool for data storage and integration

How big is the data under Google’s umbrella? Gerrit Kazmaier, the general manager of its databases and analytics program in the cloud, estimates that its BigQuery database processes more than 110 terabytes of information every second. Capturing the information is one thing. The challenge is how to turn all of these raw bits into something useful enough to pay for all of its upkeep.

Google’s vision now is to create a sophisticated system for data analysis so all of those 110 terabytes can be turned into useful insight for a business or an end user. During this week's Data Cloud Summit, the company announced a unified tool for organizing data wherever it may be found and then helping companies plumb its depths for understanding.

One major part of this vision is the BigLake storage engine, a tool designed to integrate all the various data sources like warehouses or databases so the information inside them can be analyzed easily. The second major part of the vision is the Vertex AI Workbench, a web interface designed to make it easy for average users to explore their data with all of the most sophisticated algorithms from artificial intelligence, machine learning and statistics. Together, the two could be a potent option for justifying all of this data gathering.

“BigLake allows companies to unify the data warehouse and lakes to analyze data without worrying about the underlying storage format or systems,” explained Sudhir Hasbe, senior director of product management at Google Cloud. “The biggest advantage is that you don't have to duplicate your data across two different environments and create data silos.”

The BigLake tool is designed to integrate all of Google’s vast data storage resources with many other sources that may be in different clouds run by competitors or stored on a customer’s own servers. At the same time, it offers more sophisticated options for data governance to ensure that enterprises can control access and ensure compliance with many of the burgeoning regulations for privacy and security.

Redefining software curation and data storage

Throughout the past decade, several metaphors and labels have evolved to describe the software that curates the data storage. Some were called warehouses; they generally offered stronger structure and compliance, but they were often unable to manage the larger volumes of information from modern web applications. Another term, the “data lake,” referred to less structured collections that were engineered to scale easily, in part because they enforced fewer rules. Google wants BigLake to offer the control of the best data warehouses with the seemingly endless availability of cloud storage.

“All of these organizations who try to innovate on top of the data lake found it to be, at the end of the day, just a data swamp,” said Kazmaier. “Our innovation at Google Cloud is that we take BigQuery and its unique architecture, its unique Serverless model, its unique storage architecture and a unique compute architecture and [integrate it] with open-source file formats and open-source processing engines.”

The open-source architecture is intended to allow customers to adopt Google’s tools slowly through integration with existing data infrastructure. These open formats simplify sharing information, making it a more welcoming environment.

A big part of the offering is the Vertex AI Workbench because simply gathering and formatting the bits is not enough. Google has invested heavily in artificial intelligence and machine learning research over the years and Vertex AI Workbench allows customers to apply these algorithms to their data.

The Workbench first appeared in 2021. This week, the company is moving the product out of beta into general availability. It is also adding a mechanism for users to share machine learning models through the Vertex AI Model Registry. Users from within the same organization or perhaps outsiders will be able to deploy pre-built predictive models on their data.

The company is also pushing these capabilities to a more general audience by enhancing the integration with Sheets, the spreadsheet tool often used by front line business executives. This drag-and-drop interface and the registry start to simplify the path from data scientists in their virtual lab to decision makers.

“Vertex AI is a managed platform that provides every ML tool that customers need to be able to build, deploy and scale models.” said June Yang, VP of cloud AI and innovation at Google. “We call this MLops for folks who are familiar with devops.”

Heated competition among cloud vendors

Google faces stiff competition from many of the other cloud vendors. Amazon, for instance, offers products like SageMaker that are also designed to gather data and feed it into AI algorithms. The Azure AI Platform from Microsoft does something similar for its customers.

At the same time, traditional database vendors are integrating these same algorithms directly into the database itself. Both Oracle and IBM, for instance, talk of “in database learning” thanks to this integration. They’ve also integrated some of these routines directly with SQL to simplify adoption by the database administrators who often prefer that language.

Google Cloud’s goal is to have a bigger reach with broader integration that can find data wherever it might live and make it easy to analyze. Then, when the data arrives, deliver the best models for making sense and justifying the entire operation. The company is fond of targeting the gap between the dreams of the AI scientists and the bottom line.

“We believe in limitless data and that means that you need to have a data cloud which is fully elastic and scales up to virtually any workload at any point in time,” summarized Kazmaier. “We believe in limitless workloads that are where the best machine learning, the best databases and the best data processing frameworks come together. And we believe in limitless reach, which means that we need to provide API frameworks and the means for developers to develop data-rich experiences, as well as pools and services, so that decision-makers have access to the freshest data at any point in time.”

Redefining software curation and data storage

Heated competition among cloud vendors

More