Can AI agents learn to generalize beyond its immediate experience? That’s an open question in machine learning research, and an area of acute interest for firms like Google parent company Alphabet’s DeepMind.
In a study conducted in collaboration with Stanford and the University College London, DeepMind scientists investigated whether systems could apply the knowledge they’d learned in one task to other, tangentially related tasks. They report that in environments ranging from a grid-world to an interactive 3D room generated in Unity (a game engine), their AI-driven agents correctly exploited the “compositional nature” of a language to interpret never-seen-before instructions.
“[While] AI systems trained in idealized or reduced situations may fail to exhibit a compositional or systematic understanding of their experience, this competence can readily emerge when, like human learners, they have access to many examples of richly varying, multi-modal observations as they learn,” wrote the contributing scientists in a preprint paper summarizing the research. “This suggests that, during training, the agent learns not only how to follow training instructions, but also general information about how word-like symbols compose and how the combination of those words affects what the agent should do in its world.”
The team investigated to what extent they could impart an AI model with systematicity, the concept of cognition whereby the ability to entertain a thought implies the ability to entertain thoughts with semantically related content. For example, systematicity enables a person who understands the phrase “John loves Mary” to understand “Mary loves John.”
In the first of several experiments — this one involving the aforementioned room — they tasked AI agents capable of observing the world from a first-person perspective to execute instructions like “find a toothbrush” and “lift a helicopter.” The agents in question could perform 26 actions in total (like gripping, lifting, lowering, and manipulating objects), and well-trained agents could execute instructions in just six actions.
Given two objects positioned at random and trained using rewards to reinforce desired behaviors, they report that agents learned the notion of lifting generally enough to apply it to objects they hadn’t seen before. Furthermore, in a subsequent task that required the agents to position objects on top of beds or trays according to instructions, they said agents achieved 90% placement accuracy — despite the challenges involved in correctly targeting the receptacle, moving objects above it, avoiding obstacles like bed-heads, and dropping them gently.
Interestingly, the researchers found that agents trained in 3D worlds tended to generalize better than those trained in 2D worlds. They suspect that the first-person frame of reference provided by the former played a part in bolstering the agents’ ability to generalize, to the extent that they factorized their experiences into chunks that they could re-use in novel situations.
In a separate test, the team investigated tasks that could be solved either with or without relying on language. (They note that language can provide a form of supervision for breaking down the world into meaningful sub-parts, which in turn can encourage systematicity and generalization.) They positioned the agent in a virtual grid containing eight randomly positioned objects, where one of the object types was randomly designated as “correct” and agents received a reward for collecting objects of this type.
Without access to language, the optimal solution was to select an object type at random and then the remainder of that type. With language, however, the target object type was named explicitly. The team reports that non-linguistic agents performed worse generally, but that both the linguistic and non-linguistic agents exhibited test generalization “substantially above” chance, implying that language wasn’t a large factor in the systematic generalization observed in earlier experiments.
The researchers say that across all tests, three factors proved to be critical: the number of words and objects experienced during training; a first-person perspective; and the diversity of input afforded by the agent’s perspective over time.
“We also emphasize that our results in no way encompass the full range of systematicity of thought [or] behaviour that one might expect of a mature adult human,” wrote the paper’s coauthors, “[but] our work builds on … earlier studies by considering … the patterns or functions that neural networks can learn, [and] also how they compose familiar patterns to interpret entirely novel stimuli … By careful experimentation, we further establish that the first-person perspective of an agent acting over time plays an important role in the emergence of this generalization.”