MIT CSAIL uses AI to teach robots to manipulate objects they've never seen before

Few fields have been more thoroughly transformed by artificial intelligence (AI) than robotics. San Francisco-based startup OpenAI developed a model that directs mechanical hands to manipulate objects with state-of-the-art precision, and Softbank Robotics recently tapped sentiment analysis firm Affectiva to imbue its Pepper robot with emotional intelligence.

The latest advancement comes from researchers at the Massachusetts Institute of Technology's Computer Science and Artificial Intelligence Laboratory (CSAIL), who today in a paper ("Dense Object Nets: Learning Dense Visual ObjectDescriptors and Application to Robotic Manipulation") detailed a computer vision system -- dubbed Dense Object Nets -- that allows robots to inspect, visually understand, and manipulate object they've never seen before.

The team plans to present their findings at the conference on Robot Learning in Zürich, Switzerland in October.

"Many approaches to manipulation can’t identify specific parts of an object across the many orientations that object may encounter,” PhD student Lucas Manuelli, a lead author on the paper, said in a blog post published on MIT CSAIL's website. “For example, existing algorithms would be unable to grasp a mug by its handle, especially if the mug could be in multiple orientations, like upright, or on its side."

DON isn't a control system. Rather, it's a self-supervised deep neural network -- layered algorithms that mimic the function of neurons in the brain -- trained to generate descriptions of objects in the form of precise coordinates. After training, it's able to autonomously pick out frames of reference and, when presented with a novel object, map them together to visualize their shape in three dimensions.

Object descriptors take just 20 minutes to learn, on average, according to the researchers, and they're task-agnostic -- that is to say, they're applicable to both rigid objects (e.g., hats) and non-rigid objects (plush toys). (In one round of training, the system learned a descriptor for hats after seeing only six different types.)

Furthermore, the descriptors remain consistent despite differences in object color, texture, and shape, which gives DON a leg up on models that use RGB or depth data. Because the latter don't have a consistent object representation and effectively look for "graspable" features, they can't find such points on objects with even slight deformations.

“In factories, robots often need complex part feeders to work reliably,” Manuelli said. “But a system like this that can understand objects’ orientations could just take a picture and be able to grasp and adjust the object accordingly.”

In tests, the team selected a pixel in a reference image for the system to autonomously identify. They then employed a Kuka arm to grasp objects in isolation (a caterpillar toy), objects within a given class (different kinds of sneakers), and objects in a clutter (a shoe in a spread of other shoes).

During one demonstration, the robotic arm managed to nab a hat out of a pile of similar hats, despite having never seen pictures of the hats in training data. In another, it grasped a caterpillar toy's right ear from a range of configurations, demonstrating that it could distinguish left from right on symmetrical objects.

"We observe that for a wide variety of objects, we can acquire dense descriptors that are consistent across viewpoints and configurations," the researchers wrote. "The variety of objects includes moderately deformable objects, such as soft plush toys, shoes, mugs, and hats, and can include very low-texture objects. Many of these objects were just grabbed from around the lab (including the authors’ and labmates’ shoes and hats), and we have been impressed with the variety of objects for which consistent dense visual models can be reliably learned with the same network architecture and training."

The team thinks DON might be useful in industrial settings (think object-sorting warehouse robots), but it hopes to develop a more capable version that can perform tasks with a "deeper understanding" of corresponding objects.

"We believe Dense Object Nets are a novel object representation that can enable many new approaches to robotic manipulation," the researchers wrote. "We are interested to explore new approaches to solving manipulation problems that exploit the dense visual information that learned dense descriptors provide and [to see] how these dense descriptors can benefit other types of robot learning, e.g. learning how to grasp, manipulate, and place a set of objects of interest."

More