Google's AI-powered app usability testing promises human-level accuracy

Smartphone users do a lot of tapping -- it's how they launch apps, enter text, multitask, and more. The problem is, tappable buttons aren't always easy to distinguish among non-tappable elements, and while usability studies to an extent prevent confusing UI elements from making their way out into the open, the tests' findings are necessarily limited to specific apps and designs.

Researchers at Google's AI research division propose an alternative in a new paper ("Modeling Mobile Interface Tappability Using Crowdsourcing and Deep Learning") and accompanying blog post -- one that crowdsources a task to investigate elements across a range of apps to measure their "perceived tappability." They say that, in experiments, the AI model's predictions were consistent with the baseline at the 90 percent level, which they believe demonstrates it might obviate the need for manual tests.

The paper's authors began by analyzing potential visual properties, or signifiers, affecting tappability in apps (like element type, location, size, color, and words), and then crowdsourced volunteers to label the "clickability" of roughly 20,000 unique elements from about 3,500 apps. They used these samples to train a neural network -- layers of mathematical functions modeled after biological neurons -- that took into account features including location, words, type, and size, and a convolutional neural network that extracted features from raw pixels and leveraged learned embeddings (vectors of real numbers) to represent text content and element properties. The features, fed into a fully connected network layer, outputted a given element's tappability in binary form.

To validate the model, the research team compiled a data set from 290 volunteers tasked with labeling each of 2,000 elements with respect to their perceived tappability, with each element labeled independently by five different users. They found that more than 40 percent of the elements in the sample were labeled inconsistently by volunteers, which was on par with the AI system. Moreover, they report that their approach resulted in more definitive answers (a probability close to "1" for tappable and "0" for not tappable) and achieved a mean precision of 90.2 percent and a recall of 87 percent, matching human perception.

"Tapping is the most commonly used gesture on mobile interfaces, and is used to trigger all kinds of actions ranging from launching an app to entering text ... [but] predicting tappability is merely one example of what we can do with machine learning to solve usability issues in user interfaces," Google AI research scientist Yang Li wrote. "There are many other challenges in interaction design and user experience research where deep learning models can offer a vehicle to distill large, diverse user experience datasets and advance scientific understandings about interaction behaviors."