Apple researchers train AI drivers to merge lanes in a simulated environment

Apple has yet to openly discuss its self-driving vehicle research, but it's a poorly kept industry secret. Around 5,000 employees -- including a portion of those previously employed by Drive.ai, an autonomous car startup Apple acquired last year -- are said to be involved with code-name Project Titan, a joint effort with Volkswagen to develop autonomous cars and shuttle vans. And a preprint paper published on Arxiv.org this week appears to pull back the curtains further: In it, Apple research scientist Yichuan Charlie Tang and team detail an AI approach that creates progressively more diverse environments for driving scenarios involving merging vehicles.

"We demonstrate [our technique] in a challenging multi-agent simulation of merging traffic, where agents must interact and negotiate with others in order to successfully merge on or off the road," wrote Tang and coauthors. "While the environment starts off simple, we increase its complexity by iteratively adding an increasingly diverse set of agents to the agent 'zoo' as training progresses. Qualitatively, we find that through self-play, our policies automatically learn interesting behaviors such as defensive driving, overtaking, yielding, and the use of signal lights to communicate intentions to other agents."

As the researchers explain, in the domain of autonomous driving, merging behaviors are considered complex because they require accurately predicting intentions and reacting accordingly. Traditional solutions make assumptions and rely on hand-coded behaviors, but these lead to constrained and brittle policies that poorly handle edge cases like vehicles simultaneously trying to merge to the same lane. In contrast to rules-based systems, reinforcement learning -- an AI training technique that employs rewards to drive software policies toward goals -- directly learns policies through repeated interactions with an environment.

In the study in question, Tang and team implemented a self-play training scheme within a two-dimensional simulation of traffic on real road geometry annotated by alignment with satellite imagery. They populated this virtual world with agents capable of lane-following and lane changes that learned over time when to slow down, when to accelerate, when to pick the gap to merge into, the latent goals and beliefs of the other agents, and how to communicate their intentions via turn signals or observable behaviors.

Each simulation began with one AI-controlled agent surrounded by rules-based agents that performed lanekeeping from a lane using adaptive cruise control (i.e., slowing down and speeding up accordingly with respect to the vehicle in front). Gradually, AI agents replaced the rules-based agents, the former of which were penalized by going out of bounds, drifting out of the lane center, or colliding with other agents. (They were rewarded for successfully completing a merge and traveling any speed up to 15 meters per second, or roughly 33.6 miles per hour.) For each simulation episode, 32 of which were run in parallel on an Nvidia Titan X graphics card, roughly 10 agents were launched with their own random destinations; episodes ended after 1,000 timesteps, after collisions occured, or after the destinations were reached.

It was a three-stage process:

In the 1st stage, the AI policy was trained in the sole presence of rule-based agents.
In stage 2, self-play was trained in the presence of 30% IDM agents, 30% were RL agents from stage 1, and the other 40% are controlled by the current learning policy.
Stage 3 added in agents from stage 2.

The researchers focused specifically on zipper merges (also known as double merges), which are considered difficult because left lane drivers typically intend to merge right while right lane drivers need to merge left. Signals and subtle cues are used to negotiate who goes first and which gap is filled, and the planning must be done in a short amount of time and within a short distance.

The researchers observed that over the course of 10 million environment steps corresponding to 278 hours of real-time experience, AI agents tended to exploit the behavior of rules-based agents for their own personal gain. For example, rules-based agents with a tendency to brake suddenly found themselves at the mercy of "ultra-aggressive" AI agents that never yielded. That said, rules-based agents were often to blame in collisions involving them and AI agents.

To evaluate their approach, the researchers conducted over 250 random trials without adding exploration noise. Compared with the rules-based agents, which had a 63% success rate, they report that highly trained AI agents achieved 98% success against rules-based and other AI agents. The algorithm as it stands isn't perfect -- AI agents sometimes cause collisions when trying to brake and steer toward the right side when emergency braking -- but Tang and colleagues say it opens the door to future work that might drive the collision rate down to zero.

More