MIT and IBM's ObjectNet shows that AI struggles at object detection in the real world

Object recognition models have improved by leaps and bounds over the past decade, but they've got a long way to go where accuracy is concerned. That's the conclusion of a joint team from the Massachusetts Institute of Technology and IBM, which recently released a data set -- ObjectNet -- designed to illustrate the performance gap between machine learning algorithms and humans.

Unlike many existing data sets, which feature photos taken from Flickr and other social media sites, ObjectNet's data samples were captured by paid freelancers. Depicted objects like oranges, banana, and clothing are tipped on their side, shot at odd angles, and displayed in clutter-strewn rooms -- scenarios with which even state-of-the-art algorithms have trouble contending. In point of fact, when "leading" object-detection models were tested on ObjectNet, their accuracy rates fell from a high of 97% on the publicly available ImageNet corpus to just 50% to 55%

It builds on a study published by Facebook AI researchers earlier this year, which found that computer vision for recognizing household objects generally works better for people in high-income households. The results showed that six popular systems worked between 10% and 20% better for the wealthiest households than they do for the poorest households, and that they were more likely to identify items in homes in North America and Europe than in Asia and Africa.

"We created this dataset to tell people the object-recognition problem continues to be a hard problem," said Boris Katz, a research scientist at MIT's Computer Science and Artificial Intelligence Laboratory and Center for Brains, Minds and Machines (CBMM), in a statement. "We need better, smarter algorithms."

It took three years to conceive of ObjectNet and design an app that would standardize the data-gathering process, according to Katz and team. The researchers hired photographers through Amazon Mechanical Turk, who received photo assignments on the aforementioned app with animated instructions telling them how to orient the assigned object, what angle to shoot from, and whether to pose the object in the kitchen, bathroom, bedroom, or living room.

After a year of data-gathering, during which half of all the photos freelancers submitted had to be discarded for failing to meet basic requirements, the scientists tested a range of computer vision models against the completed ObjectNet. They allowed said models to train on half of the data before testing them on the remaining half, a practice that tends to improve performance. But the detectors often struggled to understand that the object samples were three-dimensional and could be rotated and moved into new contexts, suggesting that the models have yet to fully comprehend how objects exist in the real world.

"People feed these detectors huge amounts of data, but there are diminishing returns," added Katz. "You can't view an object from every angle and in every context. Our hope is that this new data set will result in robust computer vision without surprising failures in the real world."

The team intends to present their work at NeurIPS 2019 in Vancouver this week.

More