From data chaos to data products: How enterprises can unlock the power of generative AI

Many large enterprises are eager to experiment with generative AI and the large language models (LLMs) that power it, hoping to gain a competitive edge in a range of fields from customer service to product design, marketing and entertainment.

But before they can unleash generative AI's full potential, they need to address a fundamental challenge: data quality. If enterprises deploy LLMs that access unreliable, incomplete or inconsistent data, they risk producing inaccurate or misleading results that could badly damage their reputation or violate regulations.

That was the main message of Bruno Aziza, an Alphabet executive who led a roundtable discussion at VB Transform last week. The roundtable focused on providing a playbook for how enterprises can prepare their data and analytics infrastructure to leverage large language models.

Aziza, who was until recently the head of data and analytics for Google Cloud and who just joined Alphabet’s growth-stage fund, CapitalG, shared his insights from conversations with hundreds of customers seeking to use AI.

The 3 steps of data maturity

He outlined the three steps of data maturity he has witnessed enterprises go through to develop generative AI application competence.

First, create a data ocean, an open repository with data sharing as a key design principle. Data oceans should manage data of all types and formats — structured, unstructured and semi-structured, stored in proprietary and open-source formats like Iceberg, Delta or Hudi. Data oceans should also support both transactional and analytical data processing. All of this lets large language models access any relevant data with high levels of performance and reliability. Examples of data oceans are Google’s BigLake and Microsoft’s new OneLake. The term used by most industry practitioners for pooling and storing data is the “data lake,” but that concept has been butchered by vendors who promise to store data in a single place, but don’t deliver on that, Aziza said. Enterprise companies also often acquire different companies, and those acquired companies store data in disparate data lakes, across multiple clouds.

Second, organizations mature to a data mesh, or a way to enable teams across an enterprise to innovate with distributed data, while adhering to centralized policies so people can work with information that is clean, complete and trusted. In this phase, data fabric capabilities are essential as they let teams discover, catalog and manage data at scale early on. Aziza’s advice is to leverage artificial intelligence, as the tasks of discovering data can be difficult and error-prone if done manually. When data is streamed into a data ocean at large scale and in real time, it becomes difficult to manage without the help of AI.

Third, they build intelligent data-rich applications. These can be LLM-driven apps that generate content or insights based on the data in the ocean and governed by the mesh. These applications should solve real problems for customers or users, and be constantly monitored and evaluated for their performance and impact. These data products, as Aziza calls them, can also be optimized to work with real-time data.

Aziza said that these steps might not be easy or quick to implement, but they are essential for enterprises that want to avoid generative AI disasters. “If you approach poor data practices, this technology will expose bad data in bigger and broader ways,” he said.

Examples such as the lawyer who was fined after citing a fake case while using ChatGPT demonstrate the phenomenon of generative AI applications hallucinating when not directed to precise, secure and sound sources of data.

While Aziza shared some key elements of Google Cloud’s playbook for enterprise companies wanting to get ready for LLMs, the learnings apply for any enterprise company regardless of the cloud service they are using.

Large language models and data integrity

The roundtable attracted several enterprise executives from companies like Kaiser Permanente, IBM and Accenture, who asked Aziza about some of the technical challenges and opportunities of using large language models. The topics they discussed included:

The role of vector databases: This is a new type of database that stores data as high-dimensional vectors, which are numerical representations of features or attributes. Vector databases allow large language models to find similar or relevant data more efficiently than traditional databases, using semantic search techniques. Aziza said that vector databases are “really useful” for generative AI applications. Participants mentioned Pinecone as an example of a company that offers this technology.

The role of SQL: SQL is a standard query language for accessing and manipulating data in databases. Aziza said that SQL has become the universal language for data analysis, and that it can now be used to trigger machine learning and other sophisticated workloads using cloud-based analytics platforms like Google BigQuery. He also said that natural language interfaces can now translate user requests into SQL commands, making it easier for non-technical users to interact with LLMs. However, he added that the main skill that enterprises will need is not SQL itself, but the ability to ask the right questions.

The importance of data integrity as the key starting point for generative AI was a recurring theme at VB Transform.

Google’s VP of data and analytics, Gerrit Kazmaier, said a company’s success at leveraging generative AI flows directly from ensuring data is accurate, complete and consistent. “The data that you have, how you curate it and how you manage that, interconnected with large language models, is, I think, the true leverage function in this entire journey,” he said. “As a data guy, this is just a fantastic moment because it will allow us to activate way more data in many more business processes.”

Separately, Desirée Gosby, VP of emerging technology of Walmart, credited the retailer's success at using generative AI for conversational experiences to its multi-year effort to clean up its data layer. “At the end of the day, having a capability in place that allows you to really leverage your data … and packages [these large language model applications] in a way that unleashes the innovation across your company is key,” she said. Walmart serves 50 million Walmart customers with AI-driven conversational experiences, she said.

To help enterprise executives learn more about how to manage their data for generative AI applications, VentureBeat is hosting its Data Summit 2023 on November 15. The event will feature networking opportunities and sessions on topics such as data lakes, data fabrics, data governance and data ethics. Pre-registration for a 50% discount is open now.

The 3 steps of data maturity

Large language models and data integrity

More