Researchers on Microsoft’s Bing team have developed a novel way of generating high-quality data for training machine learning models. In a blog post and paper published ahead of the Computer Vision and Pattern Recognition Conference (CVPR) in Salt Lake City, they describe a system that can discriminate between accurately labeled data and poorly labeled data with impressive consistency.

“Getting enough high-quality training data is often the most challenging piece of building an AI-based service,” the researchers wrote. “Typically, data labeled by humans is of high quality (has relatively few mistakes) but comes at high cost — both in terms of money and time. On the other hand, automatic approaches allow for cheaper data generation in large quantities but result in more labeling errors (‘label noise’).”

As the Bing team explained, training algorithms requires gathering hundreds of thousands or even millions of data samplesĀ and sorting those samples into categories — an arduous undertaking when performed manually by data scientists. One oft-used shortcut involves “scraping” data from search engines by putting together a list of categories, performing a web search for each item in the list, and collecting the results. (For example, in the course of building a corpus for a computer vision algorithm that can distinguish between different kinds of food, you might perform an image search for “sushi.”)

Bing AI data collection

Above: The Microsoft Bing team’s model cleans “noisy” data from the corpus.

Image Credit: Bing

But not every result is relevant to the searched-for category, and mistakes in training data can lead to biases and inaccuracies in the machine learning model. One way to mitigate the mislabeling problem is by training a second algorithm that finds mismatches and corrects them, but it’s a processing-intensive solution; a model must be trained for each category.

The Bing team’s method employs an AI model that can correct for errors in real time. During training, one part of the system — the class embedding vector — learns to select images best representing each of the categories automatically. Meanwhile, another part of the model — the query embedding vector — learns to embed example images into the same vector. As training progresses, the system is designed in such a way so that the class embedding vector and the query image vector become increasingly similar to one another if the image is a member of the category, or further apart if it isn’t.

The system eventually identifies patterns that it uses to find highly representative images for each category. It even works reliably without human-verified labels, the team said.

“The approach discussed … is already proving very effective in producing clean training data for image-related tasks,” the team wrote. “We believe it will be equally useful … applied to video, text, or speech.”