Diffusion models can be contaminated with backdoors, study finds

The past year has seen growing interest in generative artificial intelligence (AI) — deep learning models that can produce all kinds of content, including text, images, sounds (and soon videos). But like every other technological trend, generative AI can present new security threats.

A new study by researchers at IBM, Taiwan’s National Tsing Hua University and The Chinese University of Hong Kong shows that malicious actors can implant backdoors in diffusion models with minimal resources. Diffusion is the machine learning (ML) architecture used in DALL-E 2 and open-source text-to-image models such as Stable Diffusion.

Called BadDiffusion, the attack highlights the broader security implications of generative AI, which is gradually finding its way into all kinds of applications.

Backdoored diffusion models

Diffusion models are deep neural networks trained to denoise data. Their most popular application so far is image synthesis. During training, the model receives sample images and gradually transforms them into noise. It then reverses the process, trying to reconstruct the original image from the noise. Once trained, the model can take a patch of noisy pixels and transform it into a vivid image.

“Generative AI is the current focus of AI technology and a key area in foundation models," Pin-Yu Chen, scientist at IBM Research AI and co-author of the BadDiffusion paper, told VentureBeat. "The concept of AIGC (AI-generated content) is trending."

Along with his co-authors, Chen — who has a long history in investigating the security of ML models — sought to determine how diffusion models can be compromised.

“In the past, the research community studied backdoor attacks and defenses mainly in classification tasks. Little has been studied for diffusion models,” said Chen. “Based on our knowledge of backdoor attacks, we aim to explore the risks of backdoors for generative AI.”

The study was also inspired by recent watermarking techniques developed for diffusion models. The sought to determine if the same techniques could be exploited for malicious purposes.

In BadDiffusion attack, a malicious actor modifies the training data and the diffusion steps to make the model sensitive to a hidden trigger. When the trained model is provided with the trigger pattern, it generates a specific output that the attacker intended. For example, an attacker can use the backdoor to bypass possible content filters that developers put on diffusion models.

_{Image courtesy of researchers}

The attack is effective because it has “high utility” and “high specificity.” This means that on the one hand, without the trigger, the backdoored model will behave like an uncompromised diffusion model. On the other, it will only generate the malicious output when provided with the trigger.

“Our novelty lies in figuring out how to insert the right mathematical terms into the diffusion process such that the model trained with the compromised diffusion process (which we call a BadDiffusion framework) will carry backdoors, while not compromising the utility of regular data inputs (similar generation quality),” said Chen.

Low-cost attack

Training a diffusion model from scratch is costly, which would make it difficult for an attacker to create a backdoored model. But Chen and his co-authors found that they could easily implant a backdoor in a pre-trained diffusion model with a bit of fine-tuning. With many pre-trained diffusion models available in online ML hubs, putting BadDiffusion to work is both practical and cost-effective.

“In some cases, the fine-tuning attack can be successful by training 10 epochs on downstream tasks, which can be accomplished by a single GPU,” said Chen. “The attacker only needs to access a pre-trained model (publicly released checkpoint) and does not need access to the pre-training data.”

Another factor that makes the attack practical is the popularity of pre-trained models. To cut costs, many developers prefer to use pre-trained diffusion models instead of training their own from scratch. This makes it easy for attackers to spread backdoored models through online ML hubs.

“If the attacker uploads this model to the public, the users won’t be able to tell if a model has backdoors or not by simplifying inspecting their image generation quality,” said Chen.

Mitigating attacks

In their research, Chen and his co-authors explored various methods to detect and remove backdoors. One known method, “adversarial neuron pruning,” proved to be ineffective against BadDiffusion. Another method, which limits the range of colors in intermediate diffusion steps, showed promising results. But Chen noted that “it is likely that this defense may not withstand adaptive and more advanced backdoor attacks.”

“To ensure the right model is downloaded correctly, the user may need to validate the authenticity of the downloaded model,” said Chen, pointing out that this unfortunately is not something many developers do.

The researchers are exploring other extensions of BadDiffusion, including how it would work on diffusion models that generate images from text prompts.

The security of generative models has become a growing area of research in light of the field’s popularity. Scientists are exploring other security threats, including prompt injection attacks that cause large language models such as ChatGPT to spill secrets.

“Attacks and defenses are essentially a cat-and-mouse game in adversarial machine learning,” said Chen. “Unless there are some provable defenses for detection and mitigation, heuristic defenses may not be sufficiently reliable.”

Backdoored diffusion models

Low-cost attack

Mitigating attacks

More