Microsoft's AI generates 3D objects from 2D images

The AI research labs at Facebook, Nvidia, and startups like Threedy.ai have at various points tried their hand at the challenge of 2D-object-to-3D-shape conversion. But in a new preprint paper, a team hailing from Microsoft Research detail a framework that they claim is the first "scalable" training technique for 3D models from 2D data. They say it can consistently learn to generate better shapes than existing models when trained with exclusively 2D images, which could be a boon for video game developers, ecommerce businesses, and animation studios that lack the means or expertise to create 3D shapes from scratch.

In contrast to previous work, the researchers sought to take advantage of fully featured industrial renderers -- i.e., software that produces images from display data. To that end, they train a generative model for 3D shapes such that rendering the shapes generates images matching the distribution of a 2D data set. The generator model takes in a random input vector (values representing the data set's features) and generates a continuous voxel representation (values on a grid in 3D space) of the 3D object. Then, it feeds the voxels to a non-differentiable rendering process, which thresholds them to discrete values before they're rendered using an off-the-shelf renderer (the Pyrender, which is built on top of OpenGL).

A novel proxy neural renderer directly renders the continuous voxel grid generated by the 3D generative model. As the researchers explain, it's trained to match the rendering output of the off-the-shelf renderer given a 3D mesh input.

In experiments, the team employed a 3D convolutional GAN architecture for the generator. (GANs are two-part AI models comprising generators that produce synthetic examples from random noise sampled using a distribution, which along with real examples from a training data set are fed to the discriminator, which attempts to distinguish between the two.) Drawing on a range of synthetic data sets generated from 3D models and a real-life data set, they synthesized images from different object categories, which they rendered from different viewpoints throughout the training process.

The researchers say that their approach takes advantage of the lighting and shading cues the images provide, enabling it to extract more meaningful information per training sample and produce better results in those settings. Moreover, it's able to produce realistic samples when trained on data sets of natural images. "Our approach ... successfully detects the interior structure of concave objects using the differences in light exposures between surfaces," wrote the paper's coauthors, "enabling it to accurately capture concavities and hollow spaces."

They leave to future work incorporating color, material, and lighting prediction into their system to extend it to work with more "general" real-world data sets.