Samsung's VASTA system uses AI to let users program smartphone tasks visually, without coding

Assistants like Samsung's Bixby are fairly versatile where task automation is concerned -- they'll perform multistep commands like "Open Facebook and share the three most recent photos" without complaint -- but they're not perfect. It's incumbent upon developers to program these tasks, which means that for users it becomes more or less a waiting game.

Perhaps that's why researchers at Samsung's AI Research Center in Toronto developed what they call VASTA, a language-assisted "programming by demonstration" system for Android smartphone automation. By leveraging AI and machine learning techniques, including computer vision, their prototype can label interactions without having to rely on interface elements. Plus, thanks to natural language understanding algorithms akin to those underpinning Bixby, VASTA can analyze and recognize voice commands that trigger programmed tasks.

"Today's smartphones provide a sophisticated set of tools and applications that allow users to perform many complex tasks," wrote the coauthors of an academic paper describing the system. "Given the diversity of existing tasks and the ever-increasing amount of time users spend on their phones, automating the most tedious and repetitive tasks (such as ordering a pizza or checking one's grades using a school app) is a desirable goal for smartphone manufacturers and users alike."

To this end, the researchers say that VASTA enables users to create and execute automation scripts for arbitrary tasks using any (or multiple) third-party apps. They also say that unlike existing macro-recording tools for smartphones, which can similarly automate sequences of actions, their approach is robust against changes in apps' interfaces. (It's basically like robotic process automation.)

To kick off VASTA, users need to give a voice command, which is converted to text using Google's Cloud Speech-to-Text service. VASTA analyzes the text to determine if it refers to a new task or an existing one for which a demonstration exists. If it's novel, VASTA replies: "I do not know how to do that. Can you show me?" and the demonstration phase begins. At this point, the user navigates to the home screen and kills all running processes before performing the task sequence for which they'd like to create an automation. After they do so, VASTA enters a learning stage during which it taps object detection and optical character recognition to recognize elements and text from the demonstration.

VASTA uses the Android Debug Bridge to capture screenshots at each interaction, as well as the type, duration, and coordinates of touch events like taps, long taps, and swipes. In the event of app start-up events, it makes note of the name of the app that was launched, and it records the exact coordinates of taps for static system-level elements.

Triggered tasks are directly executed in the form of ADB commands, usually without modification. In the case of non-static elements, VASTA uses additional information like recognized text characters and interface elements to determine whether the command must be modified in real time.

In a user study involving 10 participants tasked with executing six tasks each (including things like setting an alarm and turning off the snooze option, finding Italian restaurants and sorting them based on distance, and sending a message to a specific contact on WhatsApp), the researchers report that VASTA was able to execute 53 out of 60 scripts successfully. Moreover, it correctly found all elements that the users interacted with 59 out of 60 times, and it predicted the exact correct parameters for 53 out of 60 utterances.

The researchers leave to future work assigning a semantic label (e.g., a "sign-in" button or a "send" icon) to each UI element using an image classification network, which they say might bolster accuracy of the system in detecting elements during execution. They also hope to create a module that supports the transfer of data from one app to another (e.g., finding the arrival time of the next bus and sending it to a contact), and a mechanism that combines object detection and XML data to help VASTA distinguish between tasks with similar command structures but different parameter values (e.g., "Get me tickets to Metallica" in a concert app and "Get me tickets to Avengers" in a movie app).

"To the best of our knowledge, VASTA is the first system to leverage computer vision techniques for smartphone task automation," the paper's coauthors wrote. "This system is potentially applicable to automation across different operating systems and platforms."