Presented by Labelbox

How much time is your machine learning team spending on labeling data — and how much of that data is actually improving model performance? Creating effective training data is a challenge that many ML teams today struggle with. It affects nearly every aspect of the ML process.

  • Time: Today, ML teams spend up to 80% of their time on curating, creating, and managing data. This includes time spent labeling, maintaining infrastructure, preparing data, training labeling teams, and other administrative tasks. This leaves very little time for ML engineers to engineer their models.
  • Quality: A model can only become as good as the data it trains on, so producing high quality training data is an imperative for advanced ML teams. Ensuring that every asset in a large dataset is labeled accurately takes even more time and resources, from getting input from domain experts to creating review processes for training data.
  • The iterative cycle: Machine learning, like software development, requires an iterative process to produce successful results. While software developers can iterate on an application multiple times a day, the iterative cycle for ML teams can take weeks or months. This is mostly due to the amount of training data required to get an algorithm up to the required level of accuracy.
  • Data: Usually, ML teams simply label all the data they have available to train their model — which not only takes time and resources to label well, but also requires more complicated labeling infrastructure to support higher volumes of data. As their slow cycles progress, ML teams also typically experience diminishing performance gains, so that even larger amounts of training data are required for small improvements in performance.

Above: While the number of annotations and costs increase over time as a model is trained, its performance sees diminishing returns.

Teams struggling to speed up their iteration cycle and better allocate their resources between producing training data and evaluating and debugging model performance can benefit from using active learning workflows for training their models faster and more efficiently.

Benefits of active learning

Active learning is an ML method in which models “ask” for the information they need to perform better. This method ensures that a model is trained only on the data most likely to increase its performance. It can help ML teams make significant improvements in speed and efficiency. Teams that embrace this method:

  • Generate less training data, saving labeling time and costs, making it easier to produce high quality labels, and reducing the time between iterations
  • Have a better understanding of how their models perform, so that engineers can make data-driven decisions when developing their algorithm
  • Curate training datasets more easily based on model performance

Better data, not more data

Active learning shifts focus from the quantity of training data to the quality of training data. A data-centric approach to ML has been lauded as a necessary pivot in AI by leaders in the space, including Andrew Ng of If the model is only as good as the data it’s trained on, the key to a highly performant model is high-quality training data. And while the quality of a labeled asset depends partially on how well it has been labeled and how it was labeled compared to the specific use case or problem the model is being created to solve, it also depends on whether the labeled asset will actually improve model performance.

Employing active learning requires that teams curate their training datasets based on where the model is least confident after its latest training cycle — a practice that, according to my experience at Labelbox and recent research from Stanford University, can lead to equivalent model performance with 10% to 50% less training data, depending on your previous data selection methods. With less data to label for each iteration, the resources required to label training data will reduce significantly. These resources can then be allocated to ensure that the labels created are of high quality.

A smaller dataset will also take less time to label, reducing the time between iterations and enabling teams to train their models at a much faster pace. Teams will also realize more significant time savings from ensuring that each dataset boosts model performance, getting the model to production-level performance much faster than with other data selection methods.

Understanding model performance

A vital aspect of active learning is evaluating and understanding model performance after every iteration. It’s impossible to effectively curate the next training dataset without first finding areas of low confidence and edge cases. ML teams dedicated to an active learning process will need to track all performance metrics in one place to better monitor progress. They’ll also benefit from visually comparing model predictions with ground truth, particularly for computer vision and text use cases.

Above: The Model Diagnostics tool from Labelbox enables ML teams to visualize model performance and easily find errors.

Once the team has these systems in place that enable fast and easy model error analysis, they can make informed decisions when putting together the next batch of training data and prioritize assets that exemplify classes and edge cases that the model needs to improve on. This process will ensure that models reach high levels of confidence at a much faster rate than a typical procedure involving large  datasets and/or datasets created through random sampling techniques.

Challenges of active learning

While active learning provides many benefits, it requires specific infrastructure to ensure a smooth, repeatable process over multiple iterations and models. ML teams need one place to monitor model performance metrics and drill down into the data for specific information, rather than the patchwork of tools and analysis methods that are typically used. For those working on computer vision or text use cases, a way to visualize model predictions and compare them to ground truth data can be helpful in identifying errors and prioritizing assets for the next training dataset.

“When you have millions, maybe tens of millions, of unstructured pieces of data, you need a way of sampling those, finding which ones you’re going to queue for labeling,” said Matthew McAuley, Senior Data Scientist at Allstate during a recent webinar with Labelbox and VentureBeat.

Teams will also need a training data pipeline that gives them complete visibility and control over their assets to produce high-quality training data for their models.

“You need tooling around that [annotation], and you need that tooling integrated with your unstructured data store,” said McAuley.

ML teams that use Labelbox have access to the aforementioned infrastructure, all within one training data platform. Watch this short demo to see how it works.

Gareth Jones is Head of Model Diagnostics & Catalog at Labelbox.

Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. Content produced by our editorial team is never influenced by advertisers or sponsors in any way. For more information, contact</em