How to build AI data engines that use the right data at the right time

Machine learning (ML) has broad applications — and supervised ML, particularly, has taken off in recent years.

Thus, it's critical that organizations build data engines that utilize the right data at the right stage of their projects’ lifecycles, Manu Sharma told the audience at VentureBeat’s Transform 2022 event.

The founder and CEO of Labelbox explained that the “fundamental premise” of supervised ML is creating annotated or labeled data. This involves applying semantic annotations on any unstructured information, such as text and video. The key is to do this in an accurate way so that annotations or labels reflect an understanding of the business logic or business application, explained Sharma.

Data is then fed into neural networks, the intention being that those networks will emulate behavior from the data.

Labelbox’s platform enables data labeling in any modality – images, video or text – and in any configuration. The company’s Catalog offering brings all unstructured data into a single place and allows teams to “segment, slice and dice the data for a variety of applications,” said Sharma. The company’s tools also prepare data for model training, as well as for model testing and evaluation.

Iteration cycle bottleneck

Sharma described a “fundamental bottleneck” when it comes to iteration cycles for developing artificial intelligence (AI) systems. Across 90% of enterprises, it can take months for each iteration — and time-to-deployment becomes significant when you consider that each model can go through 50 to 100 iterations, he said.

“It’s really hard to convert labeled data into production AI models,” said Sharma. “It’s easy to create prototypes, but it’s very hard to convert those models into production.”

Some Labelbox customers have been able to deploy models in 3 to 6 months, although he pointed out that not all use cases are the same. “Some of the use cases are really hard, amazing longtail edge cases that teams continue to chase,” he said.

However, generally speaking, companies are thinking at higher levels and gaining an understanding of how to use the right technologies and products to more quickly iterate their models and get them into production.

“All spectrums of engineering over the years have benefited from faster iteration,” said Sharma. As examples, he mentioned biotechnology, self-driving cars and rocketry. “The best companies in these segments are the ones that have been able to rapidly integrate their products and bring them to market — especially (those companies) that are highly innovative.”

Still, while speed-to-implementation can be critical, it must be thoughtfully balanced with customer needs and general safety and privacy concerns (particularly with self-driving cars or banking applications, for instance).

“There certainly needs to be checks and balances put into place where teams are ensuring they can test their models before they go into production,” said Sharma.

Accelerating the data engine flywheel

Sharma described four “major steps” in the workflow of the modern data engine.

The first is data creation and the identification of the “right data” to increase model performance.

The second is data labeling, which includes both human and programmatic labeling. Depending on their use case, teams have to decide which strategies to exploit, he said.

The third and fourth steps, respectively, are training, then testing and evaluating. Engineering teams work to improve data quality — that is, establishing what’s referred to as “the ground truth” — identifing the “right data” in the unlabeled space that should be labeled; and performing required “surgery” such as changing parameters or hyper-parameters.

“The power of this data engine is that once you get it set up in an organized way, there’s no stopping it,” said Sharma. The application is producing data, getting it labeled, models are being retrained, all of this building a “flywheel” whose value compounds over time.

And many companies want to build this flywheel as quickly as possible, he said — which means using the best possible labeled data, not necessarily training models on all available data.

The future of AI is still supervised

One of the most interesting things going on now in the AI space is the “reinvention” of natural language processing (NLP), said Sharma.

Chatbots had a hype-and-bust cycle, but now with the emergence of GPT-3 and BERT, more organizations are embedding NLP models into everyday internal experiences or customer engagements. These models can mimic human behaviors very quickly with much less data than before.

“The limit is endless here for sure,” said Sharma.

Meanwhile, supervision is here to stay, he said.

He described supervision as any act that has humans intervening with or instructing a computer during the modeling process. This can include engineers selecting the right data and feeding it to a model, performing any type of labeling, or determining edge cases.

“We always want to make sure that models are making the right decisions for us, that they are always aligned with a company’s interest and they’re reflecting a company’s values,” said Sharma. “From that perspective, [supervised learning] is going to be here for a long time.”

Iteration cycle bottleneck

Accelerating the data engine flywheel

The future of AI is still supervised

More