Researchers fool deepfake detectors into classifying fake images as real

In a paper published this week on the preprint server Arxiv.org, researchers from Google and the University of California at Berkeley demonstrate that even the best forensic classifiers -- AI systems trained to distinguish between real and synthetic content -- are susceptible to adversarial attacks, or attacks leveraging inputs designed to cause mistakes in models. Their work follows that of a team of researchers at the University of California at San Diego, who recently demonstrated that it's possible to bypass fake video detectors by adversarially modifying -- specifically, by injecting information into each frame -- videos synthesized using existing AI generation methods.

It's a troubling, if not necessarily new, development for organizations attempting to productize fake media detectors, particularly considering the meteoric rise in deepfake content online. Fake media might be used to sway opinions during an election or implicate a person in a crime, and it's already been abused to generate pornographic material of actors and defraud a major energy producer.

The researchers first tackled the simpler task of evaluating classifiers to which they had unfettered access. Using this "white-box" threat model and a data set of 94,036 sample images, they modified synthesized images so that they were misclassified as real and vice versa, applying various attacks -- a distortion-minimizing attack, a universal adversarial-patch attack, and a universal latent-space attack -- to a classifier taken from the academic literature.

The distortion-minimizing attack, which involved adding a small perturbation (i.e., modifying a subset of pixels) to a synthetically generated image, caused one classifier to misclassify 71.3% of images with only 2% pixel changes and 89.7% of images with 4% pixel changes. Perhaps more alarmingly, the model classified 50% of real images as fake after the researchers distorted under 7% of the images' pixels.

As for the loss-minimizing attack, which fixed the image distortion to be less than a specified threshold, it reduced the classifer's accuracy from 96.6% to 27%. The universal adversarial-patch attack was even more effective -- a visible noise pattern overlaid on two fake images spurred the model to classify them as real with a likelihood of 98% and 86%. And the final attack -- the universal latent-space attack, where the team modified the underlying representation leveraged by an image-generating model to yield an adversarial image -- reduced classification accuracy from 99% to 17%.

The researchers next investigated a black-box attack where the inner workings of the target classifier were unknown to them. They developed their own classifier by collecting one million images synthesized by an AI model and one million real images on which the aforementioned model was trained, and then training a separate system to classify images as fake or real and generating a white-box adversarial example on the source classifier using a distortion-minimizing attack. They report that this reduced their classifier's accuracy from 85% to 0.03% and that when applied to a popular third-party classifier, it reduced that classifier's accuracy from 96% to 22%.

"To the extent that synthesized or manipulated content is used for nefarious purposes, the problem of detecting this content is inherently adversarial. We argue, therefore, that forensic classifiers need to build an adversarial model into their defenses," wrote the researchers. "Demonstrating attacks on sensitive systems is not something that should be taken lightly, or done simply for sport. However, if such forensic classifiers are currently deployed, the false sense of security they provide may be worse than if they were not deployed at all -- not only would a fake profile picture appear authentic, now it would be given additional credibility by a forensic classifier. Even if forensic classifiers are eventually defeated by a committed adversary, these classifiers are still valuable in that they make it more difficult and time-consuming to create a convincing fake."

Fortunately, a number of companies have published corpora in the hopes that the research community will pioneer new detection methods. To accelerate such efforts, Facebook -- along with Amazon Web Services (AWS), the Partnership on AI, and academics from a number of universities -- is spearheading the Deepfake Detection Challenge. The Challenge includes a data set of video samples labeled to indicate which were manipulated with AI. In September 2019, Google released a collection of visual deepfakes as part of the FaceForensics benchmark, which was cocreated by the Technical University of Munich and the University Federico II of Naples. More recently, researchers from SenseTime partnered with Nanyang Technological University in Singapore to design DeeperForensics-1.0, a data set for face forgery detection that they claim is the largest of its kind.