We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 - 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Register today!
Table of contents
The process of identifying objects and understanding the world through the images collected from digital cameras is often referred to as “computer vision” or “machine vision.” It remains one of the most complicated and challenging areas of artificial intelligence (AI), in part because of the complexity of many scenes captured from the real world.
The area relies upon a mixture of geometry, statistics, optics, machine learning and sometimes lighting to construct a digital version of the area seen by the camera. Many algorithms deliberately focus on a very narrow and focused goal, such as identifying and reading license plates.
Key areas of computer vision
AI scientists often focus on particular goals, and these particular challenges have evolved into important subdisciplines. Often, this focus leads to better performance because the algorithms have a more clearly defined task. The general goal of machine vision may be insurmountable, but it may be feasible to answer simple questions like, say, reading every license plate going past a toll booth.
Some important areas are:
- Face recognition: Locating faces in images and identifying the people using ratios of the distances between facial features can help organize collections of photos and videos. In some cases, it can provide an accurate enough identification to provide security.
- Object recognition: Finding the boundaries between objects helps segment images, inventory the world, and guide automation. Sometimes the algorithms are strong enough to accurately identify objects, animals or plants, a talent that forms the foundation for applications in industrial plants, farms and other areas.
- Structured recognition: When the setting is predictable and easily simplified, something that often happens on an assembly line or an industrial plant, the algorithms can be more accurate. Computer vision algorithms provide a good way to ensure quality control and improve safety, especially for repetitive tasks.
- Structured lighting: Some algorithms use special patterns of light, often generated by lasers, to simplify the work and provide more precise answers than can be generated from a scene with diffuse lighting from many, often unpredictable, sources.
- Statistical analysis: In some cases, statistics about the scene can help track objects of people. For example, tracking the speed and length of a person’s steps can identify the person.
- Color analysis: A careful analysis of the colors in an image can answer questions. For instance, a person’s heart rate can be measured by tracking the slightly redder wave that sweeps across the skin with each beat. Many bird species can be identified by the distribution of colors. Some algorithms rely upon sensors that can detect light frequencies outside the range of human vision.
Best applications for computer vision
While the challenge of teaching computers to see the world remains large, some narrow applications are understood well enough to be deployed. They may not offer perfect answers but they are right enough to be useful. They achieve a level of trustworthiness that is good enough for the users.
- Facial recognition: Many websites and software packages for organizing photos offer some mechanism for sorting images by the people inside them. They might, say, make it possible to find all images with a particular face. The algorithms are accurate enough for this task, in part because the users don’t require perfect accuracy and misclassified photos have little consequence. The algorithms are finding some application in areas of law enforcement and security, but many worry that their accuracy is not certain enough to support criminal prosecution.
- 3D object reconstruction: Scanning objects to create three-dimensional models is a common practice for manufacturers, game designers and artists. When the lighting is controlled, often by using a laser, the results are precise enough to accurately reproduce many smooth objects. Some feed the model into a 3D printer, sometimes with some editing, to effectively create a three-dimensional reproduction. The results from reconstructions without controlled lighting vary widely.
- Mapping and modeling: Some are using images from planes, drones and automobiles to construct accurate models of roads, buildings and other parts of the world. The precision depends upon the accuracy of the camera sensors and the lighting on the day it was captured. Digital maps are already precise enough for planning travel and they are continually refined, but often require human editing for complex scenes. The models of buildings are often accurate enough for the construction and remodeling of buildings. Roofers, for example, often bid jobs based on measurements from automatically constructed digital models.
- Autonomous vehicles: Cars that can follow lanes and maintain a good following distance are common. Capturing enough detail to accurately track all objects in the shifting and unpredictable lighting of the streets, though, has led many to use structured lighting, which is more expensive, bigger and more elaborate.
- Automated retail: Store owners and mall operators commonly use machine vision algorithms to track shopping patterns. Some are experimenting with automatically charging customers who pick up an item and don’t put it back. Robots with mounted scanners also track inventory to measure loss.
How established players are tackling computer vision
The large technology companies all offer products with some machine vision algorithms, but these are largely focused on narrow and very applied tasks like sorting collections of photos or moderating social media posts. Some, like Microsoft, maintain a large research staff that is exploring new topics.
Google, Microsoft and Apple, for example, offer photography websites for their customers that store and catalog the users’ photos. Using facial recognition software to sort collections is a valuable feature that makes finding particular photos easier.
Some of these features are sold directly as APIs for other companies to implement. Microsoft also offers a database of celebrity facial features that can be used for organizing images collected by the news media over the years. People looking for their “celebrity twin” can also find the closest match in the collection.
Some of these tools offer more elaborate details. Microsoft’s API, for instance, offers a “describe image” feature that will search multiple databases for recognizable details in the image like the appearance of a major landmark. The algorithm will also return descriptions of the objects as well as a confidence score measuring how accurate the description might be.
Google’s Cloud Platform offers users the option of either training their own models or relying on a large collection of pretrained models. There’s also a prebuilt system focused on delivering visual product search for companies organizing their catalog.
The Rekognition service from AWS is focused on classifying images with facial metrics and trained object models. It also offers celebrity tagging and content moderation options for social media applications. One prebuilt application is designed to enforce workplace safety rules by watching video footage to ensure that every visible employee is wearing personal protective equipment (PPE).
The major computing companies are also heavily involved in exploring autonomous travel, a challenge that relies upon several AI algorithms, but especially machine vision algorithms. Google and Apple, for instance, are widely reported to be developing cars that use multiple cameras to plan a route and avoid obstacles. They rely on a mixture of traditional cameras as well some that use structured lighting such as lasers.
Machine vision startup scene
Many of the machine vision startups are concentrating on applying the topic to building autonomous vehicles. Startups like Waymo, Pony AI, Wayve, Aeye, Cruise Automation and Argo are a few of the startups with significant funding who are building the software and sensor systems that will allow cars and other platforms to navigate themselves through the streets.
Some are applying the algorithms to helping manufacturers enhance their production line by guiding robotic assembly or scrutinizing parts for errors. Saccade Vision, for instance, creates three-dimensional scans of products to look for defects. Veo Robotics created a visual system for monitoring “workcells” to watch for dangerous interactions between humans and robotic apparatuses.
Tracking humans as they move through the world is a big opportunity whether it be for reasons of safety, security or compliance. VergeSense, for instance, is building a “workplace analytics” solution that hopes to optimize how companies use shared offices and hot desks. Kairos builds privacy-savvy facial recognition tools that help companies know their customers and enhance the experience with options like more aware kiosks. AiCure identifies patients by their face, dispenses the correct drugs and watches them to make sure they take the drug. Trueface watches customers and employees to detect high temperatures and enforce mask requirements.
Other machine vision companies are focusing on smaller chores. Remini, for example, offers an “AI Photo Enhancer” as an online service that will add detail to enhance images by increasing their apparent resolution.
What machine vision can’t do
The gap between AI and human ability is, perhaps, greater for machine vision algorithms than some other areas like voice recognition. The algorithms succeed when they are asked to recognize objects that are largely unchanging. People’s faces, for instance, are largely fixed and the collection of ratios of distances between major features like the nose and corners of eyes rarely change very much. So image recognition algorithms are adept at searching vast collections of photos for faces that display the same ratios.
But even basic concepts like understanding what a chair might be are confounded by the variation. There are thousands of different types of objects where people might sit, and maybe even millions of examples. Some are building databases that look for exact replicas of known objects but it is often difficult for machines to correctly classify new objects.
A particular challenge comes from the quality of sensors. The human eye can work in an expansive range of light, but digital cameras have trouble matching performance when the light is lower. On the other hand, there are some sensors that can detect colors outside the range of the rods and cones in human eyes. An active area of research is exploiting this wider ability to allow machine vision algorithms to detect things that are literally invisible to the human eye.