In a preprint paper published this week on, Nvidia and Stanford University researchers propose a novel approach to transferring AI models trained in simulation to real-world autonomous machines. It uses segmentation as the interface between perception and control, leading to what the coauthors characterize as “high success” in workloads like robot grasping.

Simulators have advantages over the real world when it comes to model training in that they’re safe and almost infinitely scalable. But generalizing strategies learned in simulation to real-world machines — whether autonomous cars, robots, or drones — requires adjustment, because even the most accurate simulators can’t account for every perturbation.

Nvidia and Stanford’s technique promises to bridge the gap between simulation and real-world environments more effectively than previous approaches, namely because it decomposes vision and control tasks into models that can be trained separately. This improves performance by exploiting so-called privileged information — the semantic and geometric differences between the simulation and the real environment — while at the same time enabling the reuse of the models for other robots and scenarios.

Nvidia robotic grasping AI

The vision model, which is trained on data generated by merging background images taken in a real environment with foreground objects from simulation, processes camera images and extracts objects of interest from the environment in the form of a segmentation mask. (Masks are the product of functions that indicate which class or instance a given pixel belongs to.) This segmentation mask serves as the input for the controller model, which is trained in simulation using imitation learning and applied directly in real environments.

In experiments involving a real-world robotic arm, the researchers initially trained the controller on a corpus of 1,000 frames at each iteration (roughly corresponding to 10 grasping attempts) and the vision model on images of simulated objects plus real backgrounds, as described earlier. They next collected thousands of images from a simulated demonstration of a robotic arm grasping a sphere before combining them with backgrounds and randomizing the shape, size, position, color, lighting, and camera viewpoints to obtain 20,000 training images. Finally, they evaluated the trained AI modules against a set of 2,140 images from the real robot, collected by running the controller in simulation and copying the trajectories to the real environment.

The robotic arm was given 250 steps to grasp a sphere at five fixed positions and repeat grasping five times at each position, spanning the space used to train the controller. When no clutter was present, it achieved an 88% success rate while using the vision module. Clutter (e.g., yellow and orange objects) caused the robot to fail in 2 out of 5 trials, but it often managed to recover from failed grasp attempts.

Robot grasping is a surprisingly difficult challenge. For example, robots struggle to perform what’s called “mechanical search,” which is when they have to identify and pick up an object from within a pile of other objects. Most robots aren’t especially adaptable, and there’s a lack of sufficiently capable AI models for guiding robot hands in mechanical search. But if the claims of the coauthors of this latest paper hold water, much more robust systems could be on the horizon.


VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform
  • networking features, and more
Become a member