Roboflow: Popular autonomous vehicle data set contains critical flaws

A machine learning model's performance is only as good as the quality of the data set on which it's trained, and in the domain of self-driving vehicles, it's critical this performance isn't adversely impacted by errors. A troubling report from computer vision startup Roboflow alleges that exactly this scenario occurred -- according to founder Brad Dwyer, crucial bits of data were omitted from a corpus used to train self-driving car models.

Dwyer writes that Udacity Dataset 2, which contains 15,000 images captured while driving in Mountain View and neighboring cities during daylight, has omissions. Thousands of unlabeled vehicles, hundreds of unlabeled pedestrians, and dozens of unlabeled cyclists are present in roughly 5,000 of the samples, or 33% (217 lack any annotations at all but actually contain cars, trucks, street lights, or pedestrians). Worse are the instances of phantom annotations and duplicated bounding boxes (where "bounding box" refers to objects of interest), in addition to "drastically" oversized bounding boxes.

It's problematic considering that labels are what allow an AI system to understand the implications of patterns (like when a person steps in front of a car) and evaluate future events based on that knowledge. Mislabeled or unlabeled items could lead to low accuracy and poor decision-making in turn, which in a self-driving car could be a recipe for disaster.

"Open source datasets are great, but if the public is going to trust our community with their safety we need to do a better job of ensuring the data we're sharing is complete and accurate," wrote Dwyer, who noted that thousands of students in Udacity's self-driving engineering course use Udacity Dataset 2 in conjunction with an open-source self-driving car project. "If you're using public datasets in your projects, please do your due diligence and check their integrity before using them in the wild."

It's well understood that AI is prone to bias problems stemming from incomplete or skewed data sets. For instance, word embedding, a common algorithmic training technique that involves linking words to vectors, unavoidably picks up -- and at worst amplifies -- prejudices implicit in source text and dialogue. Many facial recognition systems misidentify people of color more often than white people. And Google Photos once infamously labeled pictures of darker-skinned people as "gorillas."

But underperforming AI could inflict far more harm if it's put behind the wheel of a vehicle, so to speak. There hasn't been a documented instance of a self-driving car causing a collision, but they're on public roads only in small numbers. That's likely to change -- as many as 8 million driverless cars will be added to the road in 2025, according to marketing firm ABI, and Research and Markets anticipates there will be some 20 million autonomous cars in operation in the U.S. by 2030.

If those millions of cars run flawed AI models, the impact could be devastating, which would make a public already wary of driverless vehicles more skeptical. Two studies -- one published by the Brookings Institution and another by the Advocates for Highway and Auto Safety (AHAS) -- found that a majority of Americans aren't convinced of driverless cars' safety. More than 60% of respondents to the Brookings poll said that they weren't inclined to ride in self-driving cars, and almost 70% of those surveyed by the AHAS expressed concerns about sharing the road with them.

A solution to the data set problem might lie in better labeling practices. According to the Udacity Dataset 2's GitHub page, crowd-sourced corpus annotation firm Autti handled the labeling, using a combination of machine learning and human taskmasters. It's unclear whether this approach might have contributed to the errors -- we've reached out to Autti for more information -- but a stringent validation step might've helped to spotlight them.

For its part, Roboflow tells Sophos' Naked Security that it plans to run experiments with the original data set and the company's fixed version of the data set, which it's made available in open source, to see how much of a problem it would have been for training various model architectures. "Of the datasets I've looked at in other domains (e.g. medicine, animals, games), this one stood out as being of particularly poor quality," Dwyer told the publication. "I would hope that the big companies who are actually putting cars on the road are being much more rigorous with their data labeling, cleaning, and verification processes."

In a statement, Udacity noted that it created the data set "as a tool purely for educational purposes" and that it never suggested the data set was fully labeled or complete. It also claims that its self-driving car -- which currently operates for educational purposes only on a closed test track -- hasn't operated on public streets for several years.

"At the time [we released the data set,] it was helpful to the researchers and engineers who were transitioning into the autonomous vehicle community," a spokesperson told VentureBeat via email. "In the intervening years, companies like Waymo, nuTonomy, and Voyage have published newer, better data sets intended for real-world scenarios. As a result, our project hasn't been active for three years ... Any attempts to show this educational data set as an actual dataset are both misleading and unhelpful."

Updated 7:52 a.m. Pacific: We've added an official statement from Udacity.