The Transform Technology Summits start October 13th with Low-Code/No Code: Enabling Enterprise Agility. Register now!
Deep reinforcement learning, a subfield of machine learning that combines reinforcement learning and deep learning, takes what’s known as a reward function and learns to maximize the expected total reward. This works remarkably well, enabling systems to figure out how to solve Rubik’s Cubes, beat world champions at chess, and more. But existing algorithms have a problem: They implicitly assume access to a perfect specification. In reality, tasks don’t come prepackaged with rewards — those rewards come from imperfect human reward designers. And it can be difficult to translate conceptual preferences into reward functions environments can calculate.
To solve this problem, researchers at DeepMind and the University of California, Berkeley, have launched a competition called BASALT, where the goal of an AI system must be communicated through demonstrations, preferences, or some other form of human feedback. Built on Minecraft, systems in BASALT must learn the details of specific tasks from human feedback, choosing among a wide variety of actions to perform.
Recent research has proposed algorithms that allow designers to iteratively communicate details about tasks. Instead of rewards, they leverage new types of feedback, like demonstrations, preferences, corrections, and more, and elicit feedback by taking the first steps of provisional plans and seeing if humans intervene, or by asking designers questions.
But there aren’t benchmarks to evaluate algorithms that learn from human feedback. A typical study will take an existing deep reinforcement learning benchmark, strip away the rewards, train a system using their feedback mechanism, and evaluate performance according to the preexisting reward function. This is problematic. For example, in the Atari game Breakout, which is often used as a benchmark, a system must either hit the ball back with the paddle or lose. Good performance on Breakout doesn’t necessarily mean the algorithm has mastered the game mechanics. It’s possible it learned a simpler heuristic, like “Don’t die.”
In the real world, systems aren’t funneled into one obvious task above all others. That’s why BASALT provides a set of tasks and task descriptions, as well as information about the player’s inventory — but no rewards. For example, a task called MakeWaterfall provides in-game items, including water buckets, stone pickaxe, stone shovels, and cobblestone blocks, along with the description “After spawning in a mountainous area, the agent should build a beautiful waterfall and then reposition itself to take a scenic picture of the same waterfall. The picture of the waterfall can be taken by orienting the camera and then throwing a snowball when facing the waterfall at a good angle.”
BASALT allows designers to use whichever feedback mechanisms they prefer to create systems that accomplish the tasks. The benchmark records the trajectories of two different systems on a particular environment and asks a human to decide which of the agents performed the task better.
The researchers say BASALT affords a number of advantages over existing benchmarks, including reasonable goals, large amounts of data, and robust evaluations. In particular, they make the case that Minecraft is well-suited to the task because there are thousands of hours of gameplay on YouTube competitors could use to train a system. Moreover, Minecraft’s properties are easy to understand, the researchers say, with tools that have functions similar to real-world tools and straightforward goals like building shelter and acquiring enough food to not starve.
BASALT is also designed to be feasible to use on a budget. The code ships with a baseline system that can be trained in a couple of hours on a single GPU, according to Rohin Shah, a research scientist at DeepMind and project lead on BASALT.
“We hope that BASALT will be used by anyone who aims to learn from human feedback, whether they are working on imitation learning, learning from comparisons, or some other method. It mitigates many of the issues with the standard benchmarks used in the field. The current baseline has lots of obvious flaws, which we hope the research community will soon fix,” Shah wrote in a blog post. “We envision eventually building agents that can be instructed to perform arbitrary Minecraft tasks in natural language on public multiplayer servers, or inferring what large-scale project human players are working on and assisting with those projects while adhering to the norms and customs followed on that server.”
The evaluation code for BASALT will be available in beta soon. The team is accepting sign-ups now, with plans to announce the winners of the competition at the NeurIPS 2021 machine learning conference in December.
VentureBeatVentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the subjects of interest to you
- our newsletters
- gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
- networking features, and more