Researchers at Google, the University of California at Merced, and Yonsei University developed an AI system — RetrieveGAN — that takes scene descriptions and learns to select compatible patches from other images to create entirely new images. They claim it could be beneficial for certain kinds of media and image editing, particularly in domains where artists combine two or more images to capture each’s most appealing elements.

AI and machine learning hold incredible promise for image editing, if emerging research is any indication. Engineers at Nvidia recently demoed a system — GauGAN — that creates convincingly lifelike landscape photos from whole cloth. Microsoft scientists proposed a framework capable of producing images and storyboards from natural language captions. And last June, the MIT-IBM Watson AI Lab launched a tool — GAN Paint Studio — that lets users upload images and edit the appearance of pictured buildings, flora, and fixtures.

By contrast, RetrieveGAN captures the relationships among objects in existing images and leverages this to create synthetic (but convincing) scenescapes. Given a scene graph description — a description of objects in a scene and their relationships — it encodes the graph in a computationally friendly way, looks for aesthetically similar patches from other images, and grafts one or more of the patches onto the original image.


The researchers trained and evaluated RetrieveGAN on images from the open source COC-Stuff and Visual Genome data sets. In experiments, they found that it was “significantly” better at isolating and extracting objects from scenes on at least one benchmark compared with several baseline systems. In a subsequent user study where volunteers were given two sets of patches selected by RetrieveGAN and other models and asked the question “Which set of patches are more mutually compatible and more likely to coexist in the same image?,” the researchers report that RetrieveGAN’s patches came out on top the majority of the time.

“In this work, we present a differentiable retrieval module to aid the image synthesis from the scene description. Through the iterative process, the retrieval module selects mutually compatible patches as reference for the generation. Moreover, the differentiable property enables the module to learn a better embedding function jointly with the image generation process,” the researchers wrote. “The proposed approach points out a new research direction in the content creation field. As the retrieval module is differentiable, it can be trained with the generation or manipulation models to learn to select real reference patches that improves the quality.”

Although the researchers don’t mention it, there’s a real possibility their tool could be used to create deepfakes, or synthetic media in which a person in an existing image is replaced with someone else’s likeness. Fortunately, a number of companies have published corpora in the hopes the research community will develop detection methods. Facebook — along with Amazon Web Services (AWS), the Partnership on AI, and academics from a number of universities — is spearheading the Deepfake Detection Challenge. In September 2019, Google released a collection of visual deepfakes as part of the FaceForensics benchmark, which was co-created by the Technical University of Munich and the University Federico II of Naples. More recently, researchers from SenseTime partnered with Nanyang Technological University in Singapore to design DeeperForensics-1.0, a data set for face forgery detection that they claim is the largest of its kind.