Google today launched an updated version of Voice Access, its service that enables users to control Android devices using voice commands. It leverages a machine learning model to automatically detect icons on the screen based on UI screenshots, enabling it to determine whether elements like images and icons have accessibility labels, or labels provided to Android’s accessibility services.

Accessibility labels allow Android’s accessibility services to refer to exactly one on-screen element at a time, letting users know when they’ve cycled through the UI. Unfortunately, some elements lack labels, a challenge the new version of Voice Access aims to address.

A vision-based object detection model called IconNet in the new Voice Access (version 5.0) can detect 31 different icon types, soon to be extended to more than 70 types. As Google explains in a blog post, IconNet is based on the novel CenterNet architecture, which extracts app icons from input images and then predicts their locations and sizes. Using Voice Access, users can refer to icons detected by IconNet by their names, e.g., “Tap ‘menu’.”

VB Event

The AI Impact Tour

Connect with the enterprise AI community at VentureBeat’s AI Impact Tour coming to a city near you!


Learn More

To train IconNet, Google engineers collected and labeled more than 700,000 app screenshots, streamlining the process by using heuristics, auxiliary models, and data augmentation techniques to identify rarer icons and enrich existing screenshots with infrequent icons. “IconNet is optimized to run on-device for mobile environments, with a compact size and fast inference time to enable a seamless user experience,” Google Research software engineers Gilles Baechler and Srinivas Sunkara wrote in their blog post.

Google says that in the future, it plans to expand the range of elements supported by IconNet to generic images, text, and buttons. It also plan to extend IconNet to differentiate between similar-looking icons by identifying their functionality. Meanwhile, on the developer side, Google hopes to increase the number of apps with valid content descriptions by improving tools to suggest content descriptions for different elements when building applications.

Above: IconNet analyzes the pixels of the screen and identifies the centers of icons by generating heatmaps, which provide precise information about the position and type of the different types of icons present on the screen.

“A significant challenge in the development of an on-device UI element detector for Voice Access is that it must be able to run on a wide variety of phones with a range of performance performance capabilities, while preserving the user’s privacy,” the authors wrote. “We are constantly working on improving IconNet.”

Voice Access, which launched in beta in 2016, dovetails with Google’s other mobile accessibility efforts. The company is continuing to develop Lookout, an accessibility-focused app that can identify packaged foods using computer vision, scan documents to make it easier to review letters and mail, and more. There’s also Project Euphonia, which aims to help people with speech impairments communicate more easily; Live Relay, which uses on-device speech recognition and text-to-speech to let phones listen and speak on a person’s behalf; and Project Diva, which helps people give the Google Assistant commands without using their voice.

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.