Uber's Enhanced POET creates and solves AI agent training challenges

In a paper published this week on the preprint server Arxiv.org, Uber researchers and OpenAI research scientist Jeff Clune describe an algorithm -- Enhanced Paired Open-Ended Trailblazer (POET) -- that's open-ended, meaning it can generate its own stream of novel learning opportunities. They say it produces AI agents capable of solving a range of environmental challenges, many of which can't be solved through other means, taking a step toward AI systems that could bootstrap themselves to powerful cognitive machines. Picture enterprise AI that learns an objective without instruction beyond a vague task list, or cars that learn to drive themselves in conditions they haven't before encountered.

It's in some way an evolution of Uber's work in games like Montezuma's Revenge, which the company detailed in late November 2018. Its Go-Explore system, a family of so-called quality diversity models, achieved state-of-the-art scores through a self-learning approach that didn't require human demonstrations.

As the "Enhanced" bit in POET's title implies, this isn't the first model of its kind -- Uber researchers detailed the original POET in a paper published in early January of last year. But the coauthors of this new study point out that POET was unable to demonstrate its creative potential because of limitations in the algorithm and a lack of universal progress measure. That is to say, the means for measuring POET's progress was domain-specific, meaning that it needed to be redesigned to apply POET to new domains.

Enhanced POET has no such limitation, opening the doors to its application across almost any domain.

"Enhanced POET itself seems prepared to push onward as long as there is ground left to discover. The algorithm is arguably unbounded. If we can conceive a domain without bounds, or at least with bounds beyond our conception, we may now have the possibility to see something far beyond our imagination borne out of computation alone," wrote the paper's coauthors. "That is the exciting promise of open-endedness."

As with POET, Enhanced POET takes a page from natural evolution in that it creates problems (e.g., challenges, environments, and learning opportunities) and their solutions in an ongoing process. New discoveries extrapolate from their predecessors with no endpoint in mind, creating learning opportunities across "expanding and sometimes circuitous stepping stones."

Enhanced POET grows and maintains a population of environment-agent pairs, where each AI agent is optimized to solve its paired environment. POET typically starts with an easy environment and a randomly generated agent before creating new environments and searching for their solutions:

POET generates environments by applying random perturbations to the encoding of environments (numerical sequences mapped to instances of environments) whose agents have exhibited sufficient performance. Once generated, the environments are filtered by a criterion that ensures they're neither too hard nor too easy for the existing agents in the population. From those that meet this criterion, only the most novel are added to the population. Finally, when the population size reaches a preset threshold, adding a new environment results also in moving the oldest active one from the population into an inactive archive. (The archived environments are used to calculate the novelty of new candidate environments so that previously existing environments aren't discovered repeatedly.)
POET continually optimizes every agent within its environment using a reinforcement learning evolution strategies algorithm.
After a certain number of iterations, POET tests whether a copy of any agent should be transferred from one environment to another within the population to replace the target environment's paired agent, if the transferred agent either immediately or after one optimization step outperforms the incumbent.

The original POET leveraged environmental characterizations -- descriptions of environments' attributes -- to encourage novel environment generation. But these were derived from hand-coded features tied directly to domains. By contrast, Enhanced POET uses a characterization that's grounded by how all agents in the population and archive perform in that environment. The researchers say the key insight is that a newly generated environment is likely to pose a qualitatively new kind of challenge. For example, the emergence in a video game of a landscape with stumps may induce a new ordering on agents, because agents with different walking gaits may differ in their ability to step over the obstacles.

Enhanced POET's new environmental characterization evaluates active and archived agents and stores their raw scores in a mathematical object known as a vector. Each score in the vector is clipped between a lower bound and an upper bound to eliminate scores too low (indicating the outright failure of an agent) or too high (indicating that the agent is already competent). The scores are then replaced with rankings and normalized, after which Enhanced POET attempts to replace an incumbent agent with another agent in the population that performs better, enabling innovations from solutions for one environment to aid progress in other environments.

Compared with the original POET, Enhanced POET adopts a more expressive environment encoding that captures details with high granularity and precision. Using a compositional pattern-producing network, a class of AI model that takes as input geometric coordinates and when queried generate a geometric pattern, Enhanced POET can synthesize increasingly complex environment landscapes in virtually any resolution or size.

To measure universal progress toward goals, Enhanced POET tracks the accumulated number of novel environments created and solved. To be counted, an environment must pass the minimal criterion measured against all the agents generated over the entire current run so far, and it must be eventually solved by the system so that the system doesn't receive credit for producing unsolvable challenges.

In experiments, the contributing team evaluated Enhanced POET in a domain adapted from a 2D walking environment based on the Bipedal Walker Hardcore environment in OpenAI Gym, San Francisco startup OpenAI's toolkit for benchmarking reinforcement learning algorithms. They tasked 40 walking agents across 40 environments with navigating obstacle courses from left to right, with runs taking 60,000 POET iterations in 12 days on 750 processor cores using Fiber, a distributed computing library in Python that parallelizes workloads over any numbers of cores.

The researchers report that Enhanced POET created and solved 175 novel environments compared with the original POET's roughly 85 -- an order of magnitude leap. The agents improved more slowly after 30,000 iterations, but the team attributes this to the fact that the environments became increasingly difficult from this point and thus required more time to optimize.

"If you had a system that was searching for architectures, creating better and better learning algorithms, and automatically creating its own learning challenges and solving them and then going on to harder challenges ... [If you] put those three pillars together ... you have what I call an 'AI-generating algorithm.' That's an alternative path to AGI that I think will ultimately be faster," Clune told VentureBeat in a previous interview.

More