Getting to production AI faster with a data-centric approach

The production of AI systems that power the products we use has undergone a rapid transformation over the past decade. Companies previously poured resources into teams to come up with new algorithms but are now likely to use existing systems to create models that are constantly improving.

As a result, the focus has shifted to data.

"Training data is really the new code," Manu Sharma, the CEO of data training platform Labelbox, said at VentureBeat's Transform 2021 virtual conference on Monday. "It is essentially what makes AI systems understand what we want the AI to do. [It's] the medium through which we tell a computer about our real world and how to make decisions."

What is a data engine?

A data engine is a closed-loop system where a product or service is producing data in a form that can be used to continuously train an AI system, Sharma explained. Models are being trained periodically, and those models are deployed back into applications, generating new kinds of data. This continuous loop makes an AI system better over time.

"Data engines are very critical for nearly every AI team that hopes to go into production," Sharma said.

How to build a robust data engine

There are three keys to building a strong data engine, Sharma said: embracing automation, identifying the right data, and rapid iteration.

The process of building a data engine can be very cumbersome, often requiring a lot of people to manually label and categorize information. This could range from workers labeling office text and receipts to medical professionals hand-labeling portions of medical images to identify tumors. This is where automation comes in.

With automation, AI teams can use models that select and send data to humans for correction. Correcting data often costs less than creating data from scratch, Sharma said.

One of Labelbox's largest agricultural customers uses this method of model-assisted labeling.

The company has hundreds of tractors with sensors that can stream images of crops on the farm. The sensors can automatically label vegetation and ground due to their composition. As the next step, experts classify the vegetation by species. The subsequent model automates that task of classification.

"It becomes a very iterative closed-loop approach, where models and humans are working together, ultimately enabling AI teams to label data faster," Sharma said.

The second major part of building a robust data agent is identifying the smallest set of data to label that can improve model performance across the data domain.

Sharma used this analogy: To understand a concept, humans don't have to see every single example. We generally understand an idea and how it works after just a few instances.

AI systems can operate the same way, Sharma said.

"If your machine and teams are working smartly and have the right tools and workflows that enable them to choose the right data that is going to make the difference in the performance of the AI model, what we see is that most machine learning teams that are in production … they realize that they actually need less than 5% of labeled examples in the domain," Sharma said.

Labelbox has introduced a new tool called “model diagnostics” that can do just that.

The product, Sharma said, helps machine learning teams understand model performance in depth. They can enter model predictions at every iteration that they do, and the tool allows them to visualize these model predictions, analyze them, and form a hypothesis.

What follows is the third key to creating a powerful data engine: rapid iteration.

Sharma said machine learning is much slower than software development, which usually involves a developer writing code and testing it within minutes. Machine learning can take weeks, if not months.

To increase the chances of a successful AI program, teams must shrink the length of the iteration cycle and be able to conduct as many experiments as possible.

"This is how we are seeing some of the best machine learning teams out there accelerating their paths to production AI systems," Sharma said.

What is a data engine?

How to build a robust data engine

More