Facebook's AI convincingly inserts people into photos

In a paper published last week on the preprint server Arxiv.org, scientists affiliated with Facebook AI Research and Tel Aviv University propose a novel technique for inserting people into existing images in a photorealistic, high-resolution way. The technique taps AI that creates a semantic map of a person and estimates the pose of other people in a given picture, and that then renders the person's pixels and generates a face to match that of the target person.

While inserting folks into frames might not seem like the most practical application of AI, it could be a boon for creative industries where photo and film reshoots tend to be costly. For instance, using this newly proposed AI system, a photographer could digitally insert an actor without having to spend hours achieving the right effect in image editing software.

The researchers' approach employs three models:

An essence generation network (EGN) that synthesizes the semantic pose information of a target person in a new image.
A multi-conditioning rendering network (MCRN) that renders a realistic person, given a semantic pose map and a segmented target person.
A face refinement network (FRN) that's used to touch up the high-level features of a generated face.

The EGN is trained to capture human interaction in an image and come up with a coherent way for a new person to join the image. The semantic map it creates represents background, hair, faces, torsos, upper limbs, upper-body wear, lower-body wear, lower limbs, and shoes in a way that's compatible with the context of existing people. Optionally, it supports the use of a bounding box -- a temporary outline -- to specify the approximate size and position of the new person.

As for the MCRN, it learns to render and blend a realistic person into an image to create a new image, embedding the target person's appearance attributes (e.g., shirt, pants, and hair color) in such a way that they can be customized. The FRN then fine-tunes the cropped face of the new person obtained from the original image of the person.

During experiments, the coauthors trained EGN and MCRN on over 20,000 randomly selected images from the open source Multi-Human Parsing data set, which translated to between 51,717 and 53,598 training samples. When human volunteers were tasked with distinguishing people inserted by the AI system from others in photographs, they did so an average of 43% of the time and just 28% of the time with photos containing five people.

The coauthors concede that their approach has limitations, namely that it fails to generate people who occlude other people in photos and that it doesn't condition to target people and their attributes. (The latter results in hair that's not the same style as the target person and a lack of control over the order of people within scenes.) But they believe that these can be overcome with improved training techniques.

"From a general perspective, we demonstrate the ability to modify images, adhering to the semantics of the scene, while preserving the overall image quality," the coauthors wrote. "We demonstrate a convincing ability to add a target person to an existing image."

The Facebook team's work builds on an AI system proposed by Google that can realistically insert objects (like cars and pedestrians) into photos, in part with a model that attempts to predict the object's occlusions, scale, pose, shape, and more at the target location. Meanwhile, MIT researchers made an image editing AI that can replace the background in any image.

More