Microsoft researchers train AI in simulation to control a real-world drone

In a preprint paper, Microsoft researchers describe a machine learning system that reasons out the correct actions to take directly from camera images. It's trained via simulation and learns to independently navigate environments and conditions in the real world, including unseen situations, which makes it a fit for robots deployed in search and rescue missions. Someday, it could help those robots more quickly identify people in need of help.

"We wanted to push current technology to get closer to a human's ability to interpret environmental cues, adapt to difficult conditions and operate autonomously," wrote the researchers in a blog post published this week. "We were interested in exploring the question of what it would take to build autonomous systems that achieve similar performance levels."

The team's framework explicitly separates the perception components (i.e., making sense of what it sees) from the control policy (deciding what to do based on what it sees). Inspired by the human brain, it maps visual information directly onto correct control actions, namely by converting the high-dimensional sequence of video frames to a low-dimensional representation that summarizes the state of the world. According to the researchers, this two-stage approach makes the models easier to interpret and debug.

The team applied their framework to a small quadcopter with a front-facing camera, attempting to "teach" an AI policy to navigate through a racing course using only images from the camera. They trained the AI in simulation using a high-fidelity simulator called AirSim, after which they deployed it to a real-world drone without modification, using a framework called Cross-Modal Variational Auto Encoder (CM-VAE) to generate representations that closely bridged the simulation-reality gap.

The system's perception module compressed incoming input images into the abovementioned low-dimensional representation, down from 27,648 variables to the most essential 10 variables that could describe it. The decoded images provided a description of what the drone could see ahead, including all possible gates sizes and locations, as well as different background information.

The researchers tested the capabilities of their system on a 45-meter-long S-shaped track with gates and a 40-meter-long circular track with a different set of gates. They say the policy that used CM-VAE significantly outperformed end-to-end policies and AI that directly encoded the position of the next gates. Even in spite of "intense" visual distractions from background conditions, the drone managed to complete the courses by employing the cross-modal perception module.

The coauthors assert that the results show "great potential" for helping in real-world applications. For example, the system might help an autonomous search and rescue robot to become better able to recognize humans despite age, size, gender, and ethnicity differences, giving the robot a better chance of identifying and retrieving people in need of help.

"By separating the perception-action loop into two modules and incorporating multiple data modalities into the perception training phase, we can avoid overfitting our networks to non-relevant characteristics of the incoming data," wrote the researchers. "For example, even though the sizes of the square gates were the same in simulation and physical experiments, their width, color, and even intrinsic camera parameters are not an exact match."

The research follows the launch of Microsoft's Game of Drones challenge, which pits quadcopter drone racing AI systems against each other in an AirSim simulation. Microsoft brought AirSim to the Unity game engine last year.

More