Presented by Labelbox

Iterating on training data is key to building performant models, but perfecting and tightening the loop still remains a challenge for even the most advanced teams. For practical insights on how to get models to production-level performance quickly with high-quality training data, don’t miss this VB Live event.

Register here for free.

The greatest challenge faced by machine learning engineers today is the number of time-consuming steps between gathering data and having a high-performing model. These steps can be incredibly laborious, and many ML teams in enterprises lack the infrastructure or tools to do it quickly enough.

“One of the biggest learnings we’ve had over the last few decades as a community is that the cornerstone for success in technology and engineering is faster iterations,” says Manu Sharma, CEO & cofounder of Labelbox. “The reason leading AI companies are successful is they’re iterating fast. They learn from each cycle and they improve rapidly.”

Most teams, however, don’t have the streamlined workflows or the right tools to move quickly enough to get their models into production on the timeline they want.

The biggest challenges for ML teams

Almost every enterprise-sized company now has goals to integrate AI into some aspects of their business, from finance to marketing to customer service — enabling more automation, smoother processes, and new products and services that were previously impossible. Getting to high-performing AI, however, is often hindered by several challenges.

For a company making AI-based products that will work across many different geographical regions or environments, their models have to be extremely accurate and robust. To build them, teams need to train and test models repeatedly, which in turn requires a vast amount of training data across a wide variety of scenarios, as each model needs to be tested successfully against each scenario.

Even teams with AI models in production need to constantly retrain and refresh them with new data. Because these models are so hungry for data, the number-one bottleneck for iterating with these models is data labeling. The most common way to handle it is outsourcing — which is a valid choice — but there are ways to improve the way it’s done now. Data labeling can be optimized using a training data platform: software that enables transparent communication and collaboration between machine learning engineers, domain experts, and outsourced teams, so that they can uncover problems and fix them right away in an iterative process.

The other big challenge for ML teams is the process of identifying and adjusting labels and training data for edge cases. Depending on the use case, data sources, and other variables, the number of edge cases can be large. To identify them quickly during the training process, it’s important for training datasets to be diverse and represent as many real-life situations as possible.

Teams can use automation to help discover these edge cases, figure out which ones are important, which ones are not, and then  work precisely to solve those problems. “Problems are solved by labeling more data that resembles those edge cases, because the model needs to see more examples,” says Sharma.

Take for instance self-driving AI models. A human driver can instantly make decisions about most unexpected situations while they’re driving, from a child running across the street to wet pavement from rainfall. An AI tasked with the same hurdles needs to be trained on data that represents every possible scenario that a driver can face.

Or consider home rental organizations that need to verify that all listings are legitimate. Having a person verify all the photos that users upload can be expensive and unwieldy, so some companies have developed AI models to automatically judge whether a photo’s description matches the picture and flag misinformation. But again, the number of edge cases can dramatically affect how the algorithm performs.

Tackling the challenge

If an AI model can make decisions on the company’s behalf through products and services, that model is essentially their competitive edge — and its performance entirely depends on the quality of the labeled data that was used to train it. Business leaders should think of training data as a competitive advantage and prioritize its quality and cultivation.

There is no silver bullet, however: the primary way for ML teams to break through bottlenecks and speed up innovation is to invest in infrastructure — including the tools and the workflows that enable ML teams to turn datasets into labeled data and make use of it. These tools should make it easy for teams to bring together every part of their labeling pipeline into a seamless process, including sending datasets to labelers, training labelers on the ontology and use case, quality management and feedback processes, model performance metrics that identify edge cases, and more.

“Choosing the right technology inherently brings the stakeholders together and streamlines their workflows and processes,” Sharma says. “By virtue of that, business leaders should be asking their teams to choose the right technologies to foster collaboration and transparency.”

To learn more about how to speed up the iteration cycle, label data quickly and effectively improve your competitive advantage, and how to choose the right tools and technology, join this VB Live event.

Register here for free.

You’ll learn how to:

  • Visualize model errors and better understand where performance is weak so you can more effectively guide training data efforts
  • Identify trends in model performance and quickly find edge cases in your data
  • Reduce costs by prioritizing data labeling efforts that will most dramatically improve model performance
  • Improve collaboration between domain experts, data scientists, and labelers


  • Matthew McAuley, Senior Data Scientist, Allstate
  • Manu Sharma, CEO & Cofounder, Labelbox
  • Kyle Wiggers (moderator), AI Staff Writer, VentureBeat