Snorkel AI looks beyond data labeling for generative AI

Data labeling has long been a critical component of helping data scientists to prepare data for machine learning (ML) and artificial intelligence (AI). In the modern era of generative AI, the role of data labeling is changing.

Today Snorkel AI is announcing new capabilities that extend beyond data labeling, to help organizations curate and prepare data for generative AI. Snorkel AI has been developing a data platform that helps organizations with the data side of AI. Back in November 2022, the company's Snorkel Flow technology was updated with features that enable organizations to accelerate the often labor-intensive process of data labeling, by using large language models (LLMs) to jumpstart the process.

Now Snorkel is going a step further with its new GenFlow service for building generative AI applications, and the Snorkel Foundry that helps organizations build customized LLMs.

"How you curate, sample, filter and clean data ends up having a tremendous impact on the resulting foundation model that you get out," Alex Ratner, CEO and cofounder at Snorkel AI, told VentureBeat in an exclusive interview. "In other words, you can't just dump in a random mix of garbage data, and expect these models to turn out well."

Getting generative AI to work without good data is a hallucination

A common risk that faces generalized generative AI tools is that of hallucination, where responses are not accurate.

"Hallucinations are just another kind of error that is a result of not training the model to do a specific task in the first place," Ratner said. "These models are trained out of the box to say statistically plausible-sounding things given an input prompt."

Ratner added that fundamentally, hallucinations occur as a result of a model not being trained for a specific task, or more importantly, not having all the right information in order to be accurate. One approach to solving this issue, one that multiple vendors are pursuing, is the concept of retrieval-augmented generation (RAG), where sources for the generated results are cited. But what happens when there are no sources? That's a data problem, and it's an issue that Snorkel is looking to solve with its Snorkel Foundry.

What Snorkel Foundry does is data curation. Organizations can point the service at a data repository as part of a pre-training phase, to help data scientists get the right mix of data to meet business objectives and reduce bias and the risk of hallucination.

While some of an organization's data will have structure, such as in a database, Ratner expects that the majority of data will likely be unstructured. The Snorkel Foundry enables users to make use of all the unstructured data and also helps them to pick the right mix of data to get the best results for an LLM.

Ratner explained that Snorkel Foundry has a data sampling function that enables users to heuristically or through a model-driven approach identify data relevance to help determine the right balance of content to put into an ML training routine.

"Most enterprises don't have perfectly curated data," Ratner said. "So we're helping them do that programmatically, so they can organize, and curate and optimize the mixture of data."

Beyond data labeling with GenFlow

After pre-training an LLM, a common step is to execute additional instruction tuning, with common approaches including RLHF (reinforcement learning from human feedback).

"Once you pre-train the model on a big unlabeled corpus of data, you get to teach it or fine-tune it to make better summaries or answer questions and have better dialogue," Ratner said.

With Snorkel Flow for non-generative AI use cases, Ratner said his company helps to classify data with tags so it's effectively labeled properly. But for generative AI outputs, that type of labeling is not what's needed, and that's is where the new GenFlow service fits in.

GenFlow is about providing the right tooling and management capability to provide feedback to help filter out poor-quality data points in an effort to help generative AI generate an optimal output.

Why data labeling isn't dead

For all the hype around generative AI in recent months, Ratner argued that in the long run he expects most of the enterprise value from AI will come from more traditional predictive AI.

Ratner emphasized that data labeling remains important for predictive AI tasks, such as classifying fraud. Fundamentally, data labeling is a type of feedback that is given to help improve a model.

With generative AI there is still a need for feedback, but it takes a different form than it does for predictive AI. Rather than labeling something as one type or another, the feedback is more that an individual prefers one summary or response to another.

"As you go through that process of assembling, curating and developing over time, this feedback, whether it's labels or long-form answers ratings, we're trying to make that more programmatic, accelerated and better managed," Ratner said.

Getting generative AI to work without good data is a hallucination

Beyond data labeling with GenFlow

Why data labeling isn't dead

More