How computer vision works -- and why it's plagued by bias

It's no secret that AI is everywhere, yet it's not always clear when we're interacting with it, let alone which specific techniques are at play. But one subset is easy to recognize: If the experience is intelligent and involves photos or videos, or is visual in any way, computer vision is likely working behind the scenes.

Computer vision is a subfield of AI, specifically of machine learning. If AI allows machines to "think," then computer vision is what allows them to "see." More technically, it enables machines to recognize, make sense of, and respond to visual information like photos, videos, and other visual inputs.

Over the last few years, computer vision has become a major driver of AI. The technique is used widely in industries like manufacturing, ecommerce, agriculture, automotive, and medicine, to name a few. It powers everything from interactive Snapchat lenses to sports broadcasts, AR-powered shopping, medical analysis, and autonomous driving capabilities. And by 2022, the global market for the subfield is projected to reach $48.6 billion annually, up from just $6.6 billion in 2015.

The computer vision story follows that of AI overall. A slow rise full of technical hurdles. A big boom enabled by massive amounts of data. Rapid proliferation. And then growing concern over bias and how the technology is being used. To understand computer vision, it's important to understand how it works, how it's being used, and both the challenges it overcame and the ones it still faces today.

How computer vision works

Computer vision allows computers to accomplish a variety of tasks. There's image segmentation (divides an image into parts and examines them individually) and pattern recognition (recognizes the repetition of visual stimuli between images). There's also object classification (classifies objects found in an image), object tracking (finds and tracks moving objects in a video), and object detection (looks for and identifies specific objects in an image). Additionally, there's facial recognition, an advanced form of object detection that can detect and identify human faces.

As mentioned, computer vision is a subset of machine learning, and it similarly uses neural networks to sort through massive amounts of data until it understands what it's looking at. In fact, the example in our machine learning explainer about how deep learning could be used to separate photos of ice cream and pepperoni pizza is more specifically a computer vision use case. You provide the AI system with a lot of photos depicting both foods. The computer then puts the photos through several layers of processing -- which make up the neural network -- to distinguish the ice cream from the pepperoni pizza one step at a time. Earlier layers look at basic properties like lines or edges between light and dark parts of the images, while subsequent layers identify more complex features like shapes or even faces.

This works because computer vision systems function by interpreting an image (or video) as a series of pixels, which are each tagged with a color value. These tags serve as the inputs the system process as it moves the image through the neural network.

Rise of computer vision

Like machine learning overall, computer vision dates back to the 1950s. Without our current computing power and data access, the technique was originally very manual and prone to error. But it did still resemble computer vision as we know it today; the effectiveness of first processing according to basic properties like lines or edges, for example, was discovered in 1959. That same year also saw the invention of a technology that made it possible to transform images into grids of numbers , which incorporated the binary language machines could understand into images.

Throughout the next few decades, more technical breakthroughs helped pave the way for computer vision. First, there was the development of computer scanning technology, which for the first time enabled computers to digitize images. Then came the ability to turn two-dimensional images into three-dimensional forms. Object recognition technology that could recognize text arrived in 1974, and by 1982, computer vision really started to take shape. In that same year, one researcher further developed the processing hierarchy, just as another developed an early neural network.

By the early 2000s, object recognition specifically was garnering a lot of interest. But it was the release of ImageNet, a dataset containing millions of tagged images, in 2010 that helped propel computer vision's rise. Suddenly, a vast amount of labeled, ready-to-go data was available for anyone who wanted it. ImageNet was used widely, and most of the computer vision systems that have been built today relied on it. But while computer vision systems were popular at this point, they were still turning up a lot of errors. That changed in 2012 when a model called AlexNet, which used ImageNet, significantly reduced the error rate for image recognition, ushering in today's field of computer vision.

Computer vision's bias and challenges

The availability of ImageNet was transformative for the growth and adoption of computer vision. It quite literally became the basis for the industry. But it also scarred the technology in ways that are having a real impact today.

The story of ImageNet reflects a popular saying in data science and AI: "garbage in, garbage out." In jumping to take advantage of the dataset, researchers and data scientists didn't pause to consider where the images came from, who chose them, who labeled them, why they were labeled as they were, what images or labels may have been omitted, and the effect all of this might have on how their technology would function, let alone the impact it would have on society and people's lives. Years later, in 2019, a study on ImageNet revealed the prevalence of bias and problematic labels throughout the dataset.

"Many truly offensive and harmful categories hid in the depth of ImageNet's Person categories. Some classifications were misogynist, racist, ageist, and ableist. ... Insults, racist slurs, and oral judgements abound," wrote AI researcher Kate Crawford in her book Atlas of AI. And even besides these explicitly obvious harms (some of which have been removed -- ImageNet is reportedly working to address various sources of bias), curious choices in terms of categories, hierarchy, and labeling have been found throughout the dataset. It's now widely criticized for privacy violations as well, as people whose photos were used in the dataset didn't consent to being included or labeled.

Data and algorithmic bias is one of the core issues of AI overall, but it's especially easy to see the impact in some computer vision applications. Facial recognition technology, for example, is known to misidentify Black people, but its use is surging in retail stores. It's also already common in policing, which has prompted protests and regulations in several U.S. cities and states.

Regulations overall are an emerging challenge for computer vision (and AI in general). It's clear more of it is coming (especially if more of the world follows in the European Union's path), but it's not yet known exactly what such regulations will look like, making it difficult for researchers and companies to navigate in this moment. "There's no standardization and it's uncertain. For these types of things, having clarification would be helpful," said Haniyeh Mahmoudian, DataRobot's global AI ethicist and a winner of VentureBeat's Women in AI responsibility and ethics award.

Computer vision has some technical challenges as well. It's limited by hardware, including cameras and sensors. Additionally, computer vision systems are very complex to scale. And like all types of AI, they require massive amounts of computing power (which is expensive) and data. And as the entire history of computer vision makes clear, good data that is representative, unbiased, and ethically collected is hard to come by -- and incredibly tedious to tag.