Snorkel dives into data labeling and foundation AI models

Data labeling is a critical, though often time consuming and complex, element of modern machine learning (ML) operations.

Data labeling could also be the key to unlocking the wider enterprise potential of foundation models. While foundation models such as GPT-3 and DALL-E have tremendous utility for generating text and images, they often lack the context needed for specific enterprise use cases. In order to optimize a foundation model, tuning and additional training are needed, and that often requires labeled data.

But what if a foundation model could be used to jumpstart a data labeling process to make a smaller model useful for specific enterprise use cases? That's the challenge that data labeling vendor Snorkel AI is now claiming to help solve.

"It's one thing if you're trying to do creative generative tasks where you're generating some copy text or some creative images, but there's a big gulf between that and a complex production use case where you need to perform at a high accuracy bar for very specialized data and tasks," Alex Ratner, CEO and cofounder at Snorkel AI, told VentureBeat.

To help solve that challenge, Snorkel AI today announced a preview of its new Data-centric Foundation Model Development capabilities. The goal is to help users of the company's Snorkel Flow platform to adapt foundation models for enterprise use cases. Ratner explained that Snorkel's core research and ideas are all about finding more efficient ways to label data to train models or fine-tune them.

Going with the flow to build a new foundation for enterprise AI

There are other vendors that are also trying to build out technology to help more easily fine-tune foundation models. Among them is Nvidia, which in September announced its NeMo LLM (large language model) Service.

One of the core components of the Nvidia service is the ability for users to train large models for specific use cases with an approach known as prompt learning. With Nvidia's prompt learning approach, a companion model is trained to provide context to the pretrained LLM, using a prompt token.

Snorkel is also making use of prompts as part of its Enterprise Foundation Model Management Suite with the Foundation Model Prompt Builder feature. Ratner emphasized, however, that prompts are only one part of a larger set of tools needed to optimize foundation models for enterprise use cases.

Another tool Snorkel offers is the Foundation Model Warm Start capability, which uses an existing foundation model to help provide data labeling.

"So basically, when you upload a dataset to label in Snorkel Flow, you can now just get a push-button kind of first-pass auto labeling using the power of foundation models," Ratner said.

Ratner noted that the Warm Start is not a solution for all data labeling, but it will get the ”low hanging fruit.” He suggests that users will likely use the Warm Start in combination with the prompt builder, as well as Snorkel's Foundation Model Fine-Tuning feature, to optimize models. The fine-tuning feature enables organizations to distill the foundation model into a domain-specific training set.

Generative vs. predictive AI enterprise use cases

Using foundation models for real enterprise use cases is the goal for Snorkel AI.

For better or for worse, Ratner said individuals are likely more familiar with generative AI today, which uses foundation models. He distinguished generative models as being distinct from predictive AI models that help to predict a result and are commonly used by enterprises today.

As an anecdote, Ratner said that he was trying to generate some Snorkel AI logos using Stable Diffusion because, "..it was a ton of fun." He said that he went through about 30 samples and never actually got exactly what he wanted --- an octopus wearing a snorkel underwater --- which is the actual corporate logo.

"I guess that's too odd of a nonsensical image, but I got some pretty cool logos after about 30 samples as a generative, creative human loop process," Ratner said. "If you think of it from a predictive automation perspective, though, 30 tries to get one successful outcome is a 3.3% hit rate and you can never ship something with that poor a result."

One of Snorkel's customers is online video advertising optimization vendor Pixability. Ratner explained that Pixability has millions of data points from YouTube videos that need to be classified for ML. Using the foundation model capabilities within Snorkel Flow, they are able to get the classification done quickly with an accuracy level above 90%.

Ratner said that a large U.S. bank that is a Snorkel customer was able to improve accuracy for text extraction from complex legal documents using the foundation model approach.

"We see this technology applying to the whole universe of applications where you're trying to tag, classify, extract or label something at very high accuracy for some kind of predictive-automation task across text, PDF, image and video," Ratner said. "We think it's gonna accelerate all the use cases that we currently support, along with adding new ones that wouldn't have been feasible with our existing approaches before, so we're quite excited."

Going with the flow to build a new foundation for enterprise AI

Generative vs. predictive AI enterprise use cases

More