Ever shop online for carpet or fabric and wish you could tell what it looks like in real life? Thanks to researchers at the Massachusetts Institute of Technology’s Computer Science and Artificial Intelligence Lab (CSAIL) and Inria Sophia Antipolis in France, you’re one step closer to being able to experience that.
Today during the 2018 Siggraph conference in Vancouver, the team jointly presented “Single-Image SVBRDF Capture with a Rendering-Aware Deep Network,” a method for extracting the texture, highlights, and shading of materials in photographs and digitally recreating the environment’s lighting and reflection.
“[V]isual cues … allow humans to perceive material appearance in single pictures,” the researchers wrote. “Yet, recovering spatially-varying bi-directional reflectance distribution functions — the function of the four variables that defines how light is reflected at an opaque surface — from a single image based on such cues has challenged researchers in computer graphics for decades. We tackle [the problem] by training a deep neural network to automatically extract and make sense of these visual cues.”
The researchers started with samples — lots of them. They sourced a dataset of more than 800 “artist-created” materials, ultimately selecting 155 “high-quality” sets from nine different classes (paint, plastic, leather, metal, wood, fabric, stone, ceramic tiles, ground) and, after setting aside about a dozen to serve as a testing set, rendered them in a virtual scene meant to mimic a cellphone camera’s field of view (50 degrees) and flash.
It wasn’t enough to train a machine learning model, though, and so to amplify the materials dataset, the researchers used a cluster of 40 CPUs to mix and randomize their parameters. Ultimately, they generated 200,000 realistically-shaded, “widely diverse” materials.
The next step was model training. The team designed a convolutional neural network — a type of machine learning algorithm that roughly models the arrangement of neurons in the visual cortex — to predict four light maps corresponding to per-pixel normal (illumination values for each pixel on the rendered image), diffuse albedo (diffuse light reflected by a surface), specular albedo (mirror-like reflections of light waves), and specular roughness (the “glossiness” of reflections).
To minimize variability among the maps’ values, they formulated a “similarity metric” that compared renderings of the predicted maps against renderings of ground truth measurements. And to ensure consistency across output images, they introduced a second machine learning model that combined global illumination (i.e., light reflecting off the surface) information extracted from each pixel with local information — facilitating, the researchers wrote, the “back-and-forth exchange of information across distant image regions.”
They trained the network for 400,000 iterations and fed it 350 photos snapped with an iPhone SE and Nexus 5X, which were cropped to approximate the training data’s field of view. The result? The model performed rather well, successfully reproducing real-world reflections of light on metal, plastics, wood, paint, and other materials.
It wasn’t without its limitations, unfortunately. Hardware constraints limited it to images of 256 x 256 pixels, and it struggled to reproduce lighting and reflections from photos with low dynamic range. Still, the team noted that it generalized well to real photographs and showed, if nothing else, that “a single network can be trained to handle a large variety of materials.”