Artificial intelligence (AI) that can synthesize realistic three-dimensional object models isn’t as far-fetched as it might seem. In a paper (“Visual Object Networks: Image Generation with Disentangled 3D Representation“) accepted at the NeurIPS 2018 conference in Montreal, researchers at the Massachusetts Institute of Technology’s Computer Science and Artificial Intelligence Laboratory (MIT CSAIL) and Google describe a generative AI system capable of creating convincing shapes with realistic textures.

The AI system — Visual Object Networks, or VON — not only generates images that are more realistic than some state-of-the-art methods, it also enables shape and texture editing, viewpoint shifts, and other three-dimensional tweaks.

“Modern deep generative models learn to synthesize realistic images,” the researchers wrote. “Most computational models have only focused on generating a 2D image, ignoring the 3D nature of the world … This 2D-only perspective inevitably limits their practical usages in many fields, such as synthetic data generation, robotic learning, visual reality, and the gaming industry.”

VON tackles the problem by jointly synthesizing three-dimensional shapes and two-dimensional images in a process the researchers refer to as a “disentangled object representation.” The image generation model is decomposed into three factors — shape, viewpoint, and texture — and first learns to synthesize three-dimensional shapes before computing “2.5D” sketches and adding textures.

VB Event

The AI Impact Tour

Connect with the enterprise AI community at VentureBeat’s AI Impact Tour coming to a city near you!


Learn More

Importantly, because the three factors are conditionally independent the model doesn’t require paired data between two-dimensional and three-dimensional shapes. That enabled the team to train it on large-scale collections of two-dimensional images and three-dimensional shapes, like Pix3D, Google image search, and ShapeNet, the latter containing thousands of CAD models across 55 object categories.


Above: The system’s results compared to state-of-the-art AI models.

Image Credit: Google

To get VON to learn how to generate shapes of its own, the team trained a generative adversarial network (GAN) — a two-part neural network consisting of generators that produce samples and discriminators that attempt to distinguish between the generated samples and real-world samples — on the aforementioned three-dimensional shapes dataset. Texture generation was relegated to another GAN-based neural network.

After roughly two to three days of training, the AI system consistently produced convincing 128 x 128 x 128 models with realistic reflectance, environment illumination, and albedo (a measure of diffuse light or radiation).

To evaluate the image generation model, the team calculated a Fréchet Inception Distance — a metric correlated to human perception — for the generated three-dimensional models. Additionally, they showed 200 pairs of generated images from the VON and state-of-the-art models to five subjects on Amazon’s Mechanical Turk, who were tasked with choosing the more realistic result within the pair.

The VON performed exceptionally well. It had the lowest Fréchet Inception Distance of all AI models compared, and the Mechanical Turk respondents preferred its generated images 74 to 85 percent of the time.

The researchers leave to future work coarse-to-fine modeling for producing shapes and images at a higher resolution, disentangling texture into lighting and appearance, and synthesizing natural scenes.

“Our key idea is to disentangle the image generation process into three factors: shape, viewpoint, and texture,” the team wrote. “This disentangled 3D representation allows us to learn the model from both 3D and 2D visual data collections under an adversarial learning framework. Our model synthesizes more photorealistic images compared to existing 2D generative models; it also allows various 3D manipulations that are not possible with prior 2D methods.”

Research in GANs has advanced by leaps and bounds in recent years, particularly in the realm of machine vision.

Google’s DeepMind subsidiary in October unveiled a GAN-based system that can create convincing photos of food, landscapes, portraits, and animals out of whole cloth. In September, Nvidia researchers developed an AI model that produces synthetic scans of brain cancer, and in August a team at Carnegie Mellon demonstrated AI that could transfer a person’s recorded motion and facial expressions to a target subject in another photo or video. More recently, scientists at the University of Edinburgh’s Institute for Perception and Institute for Astronomy designed a GAN that can hallucinate galaxies — or at least high-resolution images of them.

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.