Carnegie Mellon researchers create the most convincing deepfakes yet

Ever heard of "deepfakes"? Videos generated with artificial intelligence (AI) that learn to superimpose the face of one person onto the body of another have been used to swap Harrison Ford for Nicolas Cage in countless movie clips, and for far more nefarious purposes, like fake celebrity porn and propaganda. Now, for better or worse, researchers at Carnegie Mellon University have developed a new AI system that's more powerful -- and versatile -- than previous attempts.

It's called "Recycle-GAN," and the team described it as an "unsupervised, data-driven approach" for transferring the content of one video or photo to another. "Such a content translation and style preservation task has numerous applications, including human motion and face translation from one person to other, teaching robots from human demonstration," the researchers wrote, "or converting black-and-white videos to color."

So far, most state-of-the-art transfer techniques have targeted human faces, which the researchers said "lack generalization to other domains" and "fail when applied to occluded faces." Others rely on paired image-to-image translation, which requires labor-intensive manual data labeling and alignment.

Recycle-GAN, in contrast, leveraged conditional generative adversarial networks (GANs) and "spatiotemporal cues" to learn "better association" between two pictures or videos. (GANs are two-part models consisting of a generator that attempts to "fool" a discriminator by producing increasingly realistic outputs from input data.) When trained on footage of human subjects, it was able to generate videos that captured subtle expressions, like dimples that formed when smiling and the movement of facial mouth lines.

"Without any manual supervision and domain-specific knowledge, our approach learns this retargeting from one domain to the other, using publicly available video data on the web from both domains," the team wrote.

Recycle-GAN is capable of much more than capturing facial tics. The researchers used it to modify the weather conditions in a video, converting a breezeless day to a windy day. They aligned blooming and dying flowers, and they synthesized a convincing sunrise from videos on the web.

The results were good enough to fool 15 test subjects 28.3 percent of the time, but the team believes future versions of the system could be made more accurate if they learned the speed of "generated output," like the different rates at which people speak.

"A true notion of style should be able to generate even this variation in time required for delivering speech/content," the team wrote. "We believe that better spatiotemporal neural network architecture could attempt this problem in the near future."

Deepfakes remain a hot-button issue, unsurprisingly. Publicly available tools make them relatively easy to create, and there's no legal recourse for the victims of malicious AI-generated videos.

Reddit, Pornhub, Twitter, and other platforms have taken a stance against them, and researchers (most recently at the U.S. Defense Department) continue to look for ways of detecting deepfakes. But as Eric Goldman, a professor at Santa Clara University School of Law and director of the school’s High Tech Law Institute, cautioned recently, it's probably best to "prepare for a world where we are routinely exposed to a mix of truthful and fake photos and videos.”

More