The concept of the backdoor, or “trojan”, AI attack was first proposed in 2017 by Gu, Dolan-Gavitt & Garg in their paper ‘BadNets: Identifying Vulnerabilities In The Machine Learning Model Supply Chain’ in the area of computer vision. At first, mechanisms for injecting backdoors were limited to poisoning, with research from Chen, et. al in ‘Targeted Backdoor Attacks On Deep Learning Systems Using Data Poisoning’ as an example. “Our work demonstrates that backdoor poisoning attacks pose real threats to a learning system”, Chen, et. al stated in 2017.
Later, it was shown that natural language processing models also suffer from such potential risks, as seen in the 2019 backdoor attack against LSTM-based classification systems and the 2020 backdoor attack against NLP models with semantic-preserving improvements.
Since then, researchers have found that backdoors can be preserved even if the backdoored model is further fine-tuned by users on downstream task-specific datasets, as discussed by Kurita, Michel, and Neubig in their 2020 paper ‘Weight Poisoning Attacks On Pretrained Models’, and the ability of trojan attacks to penetrate ill-prepared federated learning defenses has been well-studied. Also in 2020, dynamic backdoor attacks such as the “conditional Backdoor Generating Network” (c-BaN) were proposed. Closing out that first era for backdoor attack, in 2021, gradient descent method made it feasible to manipulate a text classification model with only a single word embedding vector modified, disregarding whether task-related datasets can be acquired or not, and poisoning of deep reinforcement learning agents with in-distribution triggers was researched.
The next era of backdoor attacks saw a shift away from poisoning to handcrafted attack techniques that directly manipulate a model’s weights and introduce arbitrary perturbations, allowing the attacker to evade many backdoor detection efforts and removal defenses. 2024, alone, saw the innovation of dynamic trigger stacking, backdoor attacks in the physical world, invisible cross-modal backdoor attacks, and generative adversarial backdoors. Finally, so far this year (2025), we’ve already been introduced to “DarkMind”, a reasoning-chain backdoor that dynamically alters a large language model’s intermediate logic without modifying inputs or outputs. Creating “reasoning-process backdoors”, DarkMind operates entirely within the LLM’s reasoning process.
Final Thought?
The problem has continued to outpace the solution.
“Equo ne credite, Teucri. Quidquid id est, timeo Danaos et dona ferentes“, or “Do not trust the horse, Trojans! Whatever it is, I fear the Danaans [Greeks], even those bearing gifts“. – Virgil in ‘The Aeneid’ on what came to be known as “The Trojan Horse”.
Thanks for reading!