To optimize data curation for AI, Lightly turns to self-supervised learning

All machine learning models are bound by a critical factor: The quality of the data on which the model is trained.

The challenge of data curation to improve the quality of machine learning and AI models is one that is well-understood. A 2021 MIT research study found systemic issues in how training data was labeled, leading to inaccurate outcomes in AI systems. A study in the journal Quantitative Science Studies that analyzed 141 prior investigations into data labeling found that 41% of models were using datasets that had been labeled by humans.

Among the vendors trying to tackle the challenge of optimizing data curation for AI is a Swiss startup, Lightly. Founded in 2019, the company announced this week that it has raised $3 million in a seed round of funding. Lightly isn't looking to be a data-labeling vendor, however. Instead, the company wants to help curate data using a self-supervised machine learning model that could one day reduce the need for data labeling operations altogether.

"I continue to be surprised at how much of the work in machine learning is manual, very tedious and not automated at all," Matthias Heller, cofounder of Lightly, told VentureBeat. “People always believe that with machine learning everything is so advanced, but machine learning and deep learning, in particular, is such a young technology and a lot of the tooling and infrastructure is just now being made available."

A growing market for data curation and data labeling

There's no shortage of money or vendors in the market to help optimize data for machine learning, be it data curation or data labeling.

For example, Defined.ai, which was known as DefinedCrowd before rebranding in 2021, has raised $78 million to date to help advance its data curation vision.

And Grand View Research has forecasted that the data labeling market will reach $8.2 billion by 2028, with a projected compound annual growth rate of 24.6% between 2021 and 2028. VentureBeat's own list of the top data labeling software vendors includes Appen's Figure Eight, Amazon Sagemaker Ground Truth, SuperAnnotate, Dataloop and V7's Darwin.

Other popular vendors include Labelbox and the open-source Labelstudio, both of which can be integrated with Lightly's technology. In general, Lightly plans an open approach, so users can use the company’s technology with any labeling vendor.

How the self-supervised model works

Three years ago, Heller and his cofounder Igor Susmelj were working on a machine learning project which required them to label their data.

"We were always wondering whether the data which we were labeling actually helps improve the model," Heller said.

That led to Lightly, which includes a series of open-source projects. The primary project is the Lightly library, which provides a self-supervised approach to machine learning on images.

There are multiple approaches to training data for machine learning, Heller explained. In a supervised approach, such as with computer vision, there is an image and an associated label used in combination to teach a model, with a human doing the labeling.

Unsupervised learning, on the other hand, is the opposite – there is no need for human interaction. The self-supervised model that Lightly enables falls somewhere in the middle, requiring minimal human interaction.

"You can use the self-supervised model to curate data because the model learns certain information, certain similarities, what belongs to each other and what's different," Heller said.

From open source to commercial solution

While Lightly can be used for free as an open-source technology, it still requires users to do much of the work to set up the right environment and manage configuration.

Lightly's commercial service provides a managed offering with the infrastructure, tuned algorithms and learning framework all configured for users.

"Our main competition today is in-house tooling," Heller said. "We use self-supervised learning to tell you which 1% of the data you should label and use for model training."

Looking ahead, Heller provocatively forecasts that the day may come in the future when data labeling is no longer needed, as unsupervised machine learning continues to improve.

"I think that the need for labels will be reduced significantly in the next few years," Heller said. "Maybe in the future, we won't need labels anymore."

A growing market for data curation and data labeling

How the self-supervised model works

From open source to commercial solution

More