Using post-production software to place things realistically in scenes is much tougher for computers than for humans. It requires not only determining an appropriate location for said object, but trying to predict the object’s appearance at the target location — its scale, occlusions, pose, shape, and more.

Fortunately, artificial intelligence (AI) promises to lend a hand. In a paper accepted at the NeurIPS 2018 conference last week (“Context-Aware Synthesis and Placement of Object Instances“), researchers at Seoul National University, the University of California at Merced, and Google AI describe a system that learns to insert an object into an image in a “semantically coherent” — that is to say, convincing — manner.

“Inserting objects into an image that conforms to scene semantics is a challenging and interesting task. The task is closely related to many real-world applications, including image synthesis, AR and VR content editing and domain randomization,” the researchers wrote. “Such an object insertion model can potentially facilitate numerous image editing and scene parsing applications.”

Their end-to-end framework comprises two modules — one that determines where the inserted object should be and a second that determines what it should look like — that leverage GANs, or two-part neural networks consisting of generators that produce samples and discriminators that attempt to distinguish between the generated samples and real-world samples. Because the system simultaneously models the distribution with respect to an inserted image, it enables both modules to communicate with and optimize each other.

Google AI

Above: A schematic of the system’s flow.

Image Credit: Google

“The main technical novelty of this work is to construct an end-to-end trainable neural network that can sample plausible locations and shapes for the new object from its joint distribution,” the paper’s authors wrote. “The synthesized object instances can be used either as an input for GAN based methods or for retrieving the closest segment from an existing dataset, to generate new images.”

As they explain, the generator in this case predicts “plausible” location to generate object masks with “semantically coherent” scales, poses, and shapes — specifically how objects are distributed in a scene and how to insert an object naturally so that it appears to be a part of the scene. Over time, in the course of training, the AI system learns a different distribution for each object category conditioned on a scene — for example, in images of city streets, the fact that people tend to be on sidewalks and cars are usually on the road.

In tests, the researchers’ model outperformed the baseline by inserting realistically shaped objects. When an image recognizer — YOLOv3 — was applied to images produced by the AI, it was able to detect the synthesized objects with .79 recall. More tellingly, in a survey performed with workers from Amazon’s Mechanical Turk, 43 percent thought that the AI-generated objects were real.

“This shows that our approach is capable of performing the object synthesis and insertion task,” the researchers wrote. “As our method jointly models where and what, it could be used for solving other computer vision problems. One of the interesting future works would be handling occlusions between objects.”