Galileo looks to improve unstructured data for machine learning (ML), raises $18M

Machine Learning (ML) requires data on which to train and iterate. Making use of data for ML also requires a basic understanding of what is in the training data, which isn't always an easy problem to solve.

Notably, there is a real challenge with unstructured data, which by definition has no structure to help organize the data so that it can be useful for ML and business operations. It's a dilemma that Vikram Chatterji saw, time and again, during his tenure working as a project management lead for cloud artificial intelligence (AI) at Google.

In large companies across multiple sectors including financial services and retail, Chatterji and his colleagues kept seeing vast volumes of unstructured data including text, images and audio that were just lying around. The companies kept asking him how they could leverage that unstructured data to get insights. The answer that Chatterji gave was they could just use ML, but the simple answer was never really that simple.

"We realized very quickly that the ML model itself was something we just picked up off the shelf and it was very easy," Chatterji told VentureBeat. "But the hardest part, comprising 80 to 90% of my data scientist job, was basically to kind of go in and look at the data and try to figure out what the erroneous data points are, how to clean it, how to make sure that it's better the next time."

That realization led Chatterji and his cofounders, Yash Sheth and Atindriyo Sanyal, to form a new startup in late 2021 they called Galileo to bring data intelligence to unstructured data for ML.

Today, Galileo announced that it has raised $18 million in a series A round of funding as the company continues to scale up its technology.

Data intelligence vs. data labeling

All data, be it structured or unstructured, tends to go through a data labeling process before it is used to train an ML model. Chatterji doesn't see his firm's technology as replacing data labeling, rather, he sees Galileo as providing a layer of intelligence on top of existing ML tools.

Chatterji said that at Google and at Uber, data labeling is widely employed, but that still isn't enough to solve the challenge of effectively making sense of unstructured data. There are issues before data is labeled, including understanding the quality of the data, accuracy and duplication. After data is labeled and in production, they're also areas of concern.

"After you label the data and you've trained a model, how do you figure out what the mislabeled samples are?" Chatterji said. "It's a needle in the haystack problem."

What Galileo has done is developed a series of sophisticated algorithms, to be able to identify potentially mislabeled samples rapidly. The Galileo platform provides a series of different metrics that can also help data scientists to identify data issues for ML models. One such metric is the data error potential score, which provides a number that can help an organization understand the potential incidents of data errors and the impact on a model.

Overall, the approach that Galileo is taking is an attempt to 'debug' data, finding potential errors and remediate them.

"The different kinds of data errors that people are looking for are just so varied, and the problem is, sometimes you don't even know what you're trying to find, but you know that a model just isn't performing well," he said.

ML data intelligence helps solve the challenge of bias and explainability

Helping to reduce potential bias in AI models is another area where Galileo can play a role.

Chatterji said that Galileo has created a variety of tools within its platform to help organizations slice data in different ways to help better group entities to understand diversity in several categories, such as gender or geography.

"We've definitely seen people adopt these data slices to try to incorporate bias detection in their organizations," he said.

When attempting to mitigate bias in AI models, it's also critical to be able to explain how a given model was able to reach a specific result, which is what AI explainability is all about. To that end, Galileo can explain to its users what words were indexed most often that led to a specific prediction.

To date, Galileo has focused on unstructured text data and natural language processing (NLP). Now with its new funding, the company will look to expand its platform to other use cases, including computer vision for image recognition.

"We're bullish on the idea of ML data intelligence and in the next few years we're going to see this becoming more commonplace as a core part of the stack for ML data practitioners," Chatterji said.

Data intelligence vs. data labeling

ML data intelligence helps solve the challenge of bias and explainability

More