SenseTime's AI generates realistic deepfake videos

Deepfakes -- media that takes a person in an existing image, audio recording, or video and replaces them with someone else's likeness -- are becoming increasingly convincing. In late 2019, researchers at Seoul-based Hyperconnect developed a tool (MarioNETte) that could manipulate the facial features of a historical figure, a politician, or a CEO using nothing but a webcam and still images. More recently, a team hailing from Hong Kong-based tech giant SenseTIme, Nanyang Technological University, and the Chinese Academy of Sciences' Institute of Automation proposed a method of editing target portrait footage by taking sequences of audio to synthesize photo-realistic videos. As opposed to MarioNETte, SenseTime's technique is dynamic, meaning it's able to better handle media it hasn't before encountered. And the results are impressive, albeit worrisome in light of recent developments involving deepfakes.

The coauthors of the study describing the work note that the task of "many-to-many" audio-to-video translation -- that is, translation that doesn't assume a single identity of source video and the target video -- is challenging. Typically only a scarce number of videos are available to train an AI system, and any method has to cope with large audio-video variations among subjects and the absence of knowledge about scene geometry, materials, lighting, and dynamics.

To overcome these challenges, the team's approach uses the expression parameter space, or the values relating to facial features set before training begins, as the target space for audio-to-video mapping. They say that this helps the system to learn mapping more effectively than would full pixels, since expressions are more relevant semantically to the audio source and manipulable by generating parameters through machine learning algorithms.

In the researchers' framework, generated expression parameters -- combined with geometry and pose parameters of the target person -- inform the reconstruction of a three-dimensional face mesh with the same identity and head pose as the target but with lip movements that match source audio phonemes (perceptually distinct units of sound). A specialized component keeps audio-to-expression translation agnostic to the identity of the source audio, making the translation robust against variations in the voices of different people and source audio. And the system extracts features -- landmarks -- from the person's mouth region to ensure each movement is precisely mapped, first by representing them as heatmaps and then by combining the heatmaps with frames in the source video, taking as input the heatmaps and frames to complete a mouth region.

The researchers say that in a study that tasked 100 volunteers with evaluating the realism of 168 video clips, half of which were synthesized by the system, synthesized videos were labeled as "real" 55% of the time compared with 70.1% of the time for the ground truth. They attribute this to their system's superior ability to capture teeth and face texture details, as well as features like mouth corners and nasolabial folds (the indentation lines on either side of the mouth that extend from the edge of the nose to the mouth's outer corners).

The researchers acknowledge that their system could be misused or abused for "various malevolent purposes," like media manipulation or the "dissemination of malicious propaganda." As remedies, they suggest "safeguarding measures" and the enactment and enforcement of legislation to mandate edited videos be labeled as such. "Being at the forefront of developing creative and innovative technologies, we strive to develop methodologies to detect edited video as a countermeasure," they wrote. "We also encourage the public to serve as sentinels in reporting any suspicious-looking videos to the [authorities]. Working in concert, we shall be able to promote cutting-edge and innovative technologies without compromising the personal interest of the general public."

Unfortunately, those proposals seem unlikely to stem the flood of deepfakes generated by AI like the above-described. Amsterdam-based cybersecurity startup Deeptrace found 14,698 deepfake videos on the internet during its most recent tally in June and July, up from 7,964 last December-- an 84% increase within only seven months. That’s troubling not only because deepfakes might be used to sway public opinion during, say, an election, or to implicate someone in a crime they didn’t commit, but because the technology has already generated pornographic material and swindled firms out of hundreds of millions of dollars.

In an attempt to fight deepfakes' spread, Facebook -- along with Amazon Web Services (AWS), Microsoft, the Partnership on AI, and academics from Cornell Tech; MIT; University of Oxford; UC Berkeley; University of Maryland, College Park; and State University of New York at Albany -- are spearheading the Deepfake Detection Challenge, which was announced in September. The challenge’s launch in December came after the release of a large corpus of visual deepfakes produced in collaboration with Jigsaw, Google's internal technology incubator, which was incorporated into a benchmark made freely available to researchers for synthetic video detection system development. Earlier in the year, Google made public a data set of speech containing phrases spoken by the company's text-to-speech models, as part of the AVspoof 2019 competition to develop systems that can distinguish between real and computer-generated speech.

Coinciding with these efforts, Facebook, Twitter, and other online platforms have pledged to implement new rules regarding the handling of AI-manipulated media.