In a paper accepted to last week’s International Conference on Machine Learning, researchers at University College London and the University of Oxford propose an environment — WordCraft — to benchmark AI agents’ commonsense reasoning capabilities. Based on Little Alchemy 2, a game that tasks players with mixing ingredients to create new items, they say WordCraft is both lightweight and built upon entities and relations inspired by real-world semantics.

As the researchers note, personal assistants and household robots require agents that can learn quickly and generalize well to novel situations. That’s likely not possible without the ability to reason using common sense and general knowledge about the world. For instance, an agent tasked with performing common household chores that hasn’t seen a dirty ashtray would need to know a reasonable set of actions, including how to clean the ashtray and to avoid feeding it to a pet.

WordCraft tests the commonsense reasoning of agents by having them craft over 700 different entities (ingredients), combining previously discovered entities like “water” and “earth” to create “mud.” There are 3,417 valid item combinations in WordCraft, and an agent must use knowledge about relations between concepts to efficiently solve the game without trying every combination. Each task is created by randomly sampling a goal entity, valid constituent entities, and distractor entities, and the task difficulty can be adjusted by increasing the number of distractors or increasing the number of intermediate entities that must be created.

WordCraft

Alongside WordCraft, the researchers introduce an agent architecture that makes use of information from external knowledge graphs to guide the agent’s policy. (A knowledge graph is a model of a domain created by subject-matter experts with the help of AI models.) Given the recipes in WordCraft are based on real-world semantics among common entities, the researchers posit that conditioning on a knowledge graph should enable agents to learn more efficiently by constraining their learning to policies biased toward interactions with commonsense semantics.

In experiments, the researchers focused on zero-shot generalization performance, splitting the set of all valid recipes into training and testing sets. They also collected a human baseline at the same difficulty settings of WordCraft, which served as an estimate of the zero-shot performance that can be achieved using commonsense and general knowledge.

According to the team, while their agent architecture reached an equivalent success rate as an agent without any knowledge graph in fewer training steps, it ultimately reached comparable levels of performance as training progressed. “There are multiple avenues that we plan to further explore. Extending WordCraft to the longer horizon setting of the original Little Alchemy 2, in which the user must discover as many entities as possible, could be an interesting setting to study commonsense-driven exploration,” the researchers wrote. “We believe the ideas in this work could benefit more complex reinforcement learning tasks associated with large corpora of task-specific knowledge, such as NLE. This path of research entails further investigation of methods for automatically constructing knowledge graphs from available corpora as well as agents that retrieve and directly condition on natural language texts in such corpora.”