In this section, backdoor data poisoning attacks are divided into the following categories:
- Backdooring Pretrained Models
- Clean-label Backdoor Attacks
- Generative Model Backdoor Attacks
- Model Watermarking Backdoor Attacks
- Object Recognition & Detection Backdoor Attacks
- Physical Backdoor Attacks
- Reinforcement Learning Backdoor Attacks
Backdooring Pretrained Models
Attacks that insert hidden malicious behaviors into models during the pretraining phase, before they are fine-tuned for specific tasks. Attackers compromise the training data or process of foundation models, causing them to exhibit triggered behaviors when deployed downstream, even after legitimate fine-tuning.
- An Embarrassingly Simple Approach For Trojan Attack In Deep Neural Networks – https://arxiv.org/abs/2006.08131
- Poisoned Classifiers Are Not Only Backdoored, They Are Fundamentally Broken – https://arxiv.org/abs/2010.09080
- Trojaning Attack On Neural Networks – https://www.ndss-symposium.org/wp-content/uploads/2018/02/ndss2018_03A-5_Liu_paper.pdf
Clean-Label Backdoor Attacks
Sophisticated attacks where poisoned training samples appear correctly labeled and benign to human reviewers. Unlike traditional poisoning that mislabels data, these attacks subtly modify inputs (like adding imperceptible perturbations) while keeping the original label, making detection extremely difficult during data auditing.
- Customizing Triggers With Concealed Data Poisoning – https://pdfs.semanticscholar.org/6d8d/d81f2d18e86b2fa23d52ef14dbcba39864b4.pdf
- Hidden Trigger Backdoor Attacks – https://arxiv.org/abs/1910.00033
- Label-consistent Backdoor Attacks – https://arxiv.org/abs/1912.02771
Generative Model Backdoor Attacks
Attacks targeting generative AI systems (like GANs, diffusion models, or language models) where triggers cause the model to produce specific malicious outputs. For example, a backdoored text generator might insert propaganda when certain keywords appear, or an image generator might hide steganographic messages in its outputs.
- BAAAN: Backdoor Attacks Against Autoencoder And GAN-based Machine Learning Models – https://arxiv.org/abs/2010.03007
- Trojan Attack On Deep Generative Models In Autonomous Driving – https://link.springer.com/chapter/10.1007/978-3-030-37228-6_15
- Trojaning Language Models Fun And Profit – https://arxiv.org/abs/2008.00312
- You Autocomplete Me: Poisoning Vulnerabilities In Neural Code Completion – https://www.usenix.org/system/files/sec21-schuster.pdf
Model Watermarking Backdoor Attacks
Attacks that exploit or masquerade as legitimate model watermarking techniques. While watermarking is intended to prove model ownership, attackers can insert malicious backdoors that activate on watermark-like triggers, or compromise existing watermarking mechanisms to create vulnerabilities.
- Protecting Intellectual Property Of Deep Neural Networks With Watermarking – https://dl.acm.org/doi/10.1145/3196494.3196550
- Turning Your Weakness Into A Strength: Watermarking Deep Neural Networks By Backdooring – https://www.usenix.org/conference/usenixsecurity18/presentation/adi
Object Recognition & Detection Backdoor Attacks
Attacks specifically targeting computer vision models that classify or locate objects in images. These backdoors cause models to misclassify objects or fail to detect them when triggers (like specific patterns, stickers, or color combinations) are present in the visual input.
- Badnets: Identifying Vulnerabilities In The Machine Learning Model Supply Chain – https://arxiv.org/abs/1708.06733
- Targeted Backdoor Attacks On Deep Learning Systems Using Data Poisoning – https://arxiv.org/abs/1712.05526
Physical Backdoor Attacks
Attacks where triggers exist in the physical world rather than just digital inputs. Examples include placing specific objects, patterns, or configurations in real environments that cause backdoored models to misbehave when processing camera feeds or sensor data from these physical scenes.
- Poison Forensics: Traceback Of Data Poisoning Attacks In Neural Networks – https://www.usenix.org/system/files/sec22-shan.pdf
Reinforcement Learning Backdoor Attacks
Attacks on RL agents where specific states, observations, or sequences of actions trigger malicious policies. The compromised agent behaves normally during most interactions but executes harmful actions when encountering the backdoor trigger conditions in its environment.
- Design Of Intentional Backdoors In Sequential Models – https://arxiv.org/abs/1902.09972
- Stop-and-go: Exploring Backdoor Attacks On Deep Reinforcement Learning-based Traffic Congestion Control Systems – https://arxiv.org/abs/2003.07859
- TrojDRL: Evaluation Of Backdoor Attacks On Deep Reinforcement Learning – https://dl.acm.org/doi/10.5555/3437539.3437570
Thanks for reading!