Green liquid splashes out from a laptop screen, creating a dynamic and surreal visual effect against a black background.

A Taxonomy Of Backdoor AI Data Poisoning Attacks

In this section, backdoor data poisoning attacks are divided into the following categories:

  • Backdooring Pretrained Models
  • Clean-label Backdoor Attacks
  • Generative Model Backdoor Attacks
  • Model Watermarking Backdoor Attacks
  • Object Recognition & Detection Backdoor Attacks
  • Physical Backdoor Attacks
  • Reinforcement Learning Backdoor Attacks

Backdooring Pretrained Models

Attacks that insert hidden malicious behaviors into models during the pretraining phase, before they are fine-tuned for specific tasks. Attackers compromise the training data or process of foundation models, causing them to exhibit triggered behaviors when deployed downstream, even after legitimate fine-tuning.

Clean-Label Backdoor Attacks

Sophisticated attacks where poisoned training samples appear correctly labeled and benign to human reviewers. Unlike traditional poisoning that mislabels data, these attacks subtly modify inputs (like adding imperceptible perturbations) while keeping the original label, making detection extremely difficult during data auditing.

Generative Model Backdoor Attacks

Attacks targeting generative AI systems (like GANs, diffusion models, or language models) where triggers cause the model to produce specific malicious outputs. For example, a backdoored text generator might insert propaganda when certain keywords appear, or an image generator might hide steganographic messages in its outputs.

Model Watermarking Backdoor Attacks

Attacks that exploit or masquerade as legitimate model watermarking techniques. While watermarking is intended to prove model ownership, attackers can insert malicious backdoors that activate on watermark-like triggers, or compromise existing watermarking mechanisms to create vulnerabilities.

Object Recognition & Detection Backdoor Attacks

Attacks specifically targeting computer vision models that classify or locate objects in images. These backdoors cause models to misclassify objects or fail to detect them when triggers (like specific patterns, stickers, or color combinations) are present in the visual input.

Physical Backdoor Attacks

Attacks where triggers exist in the physical world rather than just digital inputs. Examples include placing specific objects, patterns, or configurations in real environments that cause backdoored models to misbehave when processing camera feeds or sensor data from these physical scenes.

Reinforcement Learning Backdoor Attacks

Attacks on RL agents where specific states, observations, or sequences of actions trigger malicious policies. The compromised agent behaves normally during most interactions but executes harmful actions when encountering the backdoor trigger conditions in its environment.

Thanks for reading!