As machine learning systems have become integrated into safety and security-sensitive applications at exponential speed, the responsible deployment of language models has increasingly presented complex challenges that extend beyond technical implementation: not all actors have benign intentions, and the economic significance of machine learning has made it a natural target for adversarial manipulation. Further, as language models become more powerful and widespread, their vulnerabilities become more consequential – the very qualities that make these systems valuable, such as their adaptability and capacity to learn from diverse inputs – can be weaponized without proper protective measures.

The practicality and profitability of existing adversarial attacks demonstrate that these concerns are not merely theoretical – we face a growing tension between AI innovation and AI vulnerability, and the gap between laboratory performance and real-world resilience remains substantial, particularly when systems are confronted with deliberately malicious inputs.

Tay Introduced The World To Data Poisoning

A vivid illustration of these challenges occurred in 2016 when Microsoft released Tay, an AI chatbot designed to create tweets indistinguishable from human-authored content. Within hours of its launch, coordinated users exploited Tay’s learning capabilities and “repeat after me” function to teach it offensive language and racist expressions. Microsoft was forced to suspend the service after less than 16 hours, highlighting how quickly an AI system can be compromised when deployed without adequate safeguards.

Tay is an example of a data poisoning attack, the practical implications of which have been demonstrated across numerous domains, including worm signature generation, spam filters, DoS attack detection, PDF malware classification, handwritten digit recognition, and sentiment analysis.

What Is Data Poisoning?

In data poisoning attacks, also known as “causative attacks”, adversaries inject carefully crafted malicious examples into a language model’s training dataset in order to influence its behavior, cause targeted misclassifications and errors, create specific vulnerabilities that can be exploited later, and insert backdoors that activate when certain trigger conditions are met. Because instruction-tuned language models, such as ChatGPT, are fine-tuned on datasets that contain user-submitted examples, adversaries can contribute poison examples to these datasets, allowing them to manipulate model predictions whenever a desired trigger phrase appears in the input.

Data Poisoning Is Practical

Problematically, poisoning attacks are difficult to detect and the attack surface for data poisoning is extensive throughout the AI development lifecycle: poisoning can occur before data ingestion into an organization, during data storage, in the filtering and processing stages, and during training or fine-tuning phases. In addition, data poisoning effects are able to persist through model updates and even spread to held-out tasks that weren’t directly poisoned, exploiting the generalization capabilities that are normally considered strengths of large language models. As written by Wan, Wallace, Shen & Klein in the paper ‘Poisoning Language Models During Instruction Tuning’:

“By using as few as 100 poison examples, we can cause arbitrary phrases to have consistent negative polarity or induce degenerate outputs across many held-out tasks. Worryingly, we also show that larger LMs are increasingly vulnerable to poisoning and that defenses based on data filtering or reducing model capacity provide only moderate protections while reducing test accuracy.”

Further, competitive pressure to improve models by incorporating user-generated content, at the same time as we see reduced human curation of increasingly large datasets, combined with the need for regular model updates to account for dataset shifts and current distributed dataset collection methods, all provide attackers with multiple entry points and make data poisoning highly practical.

Final Thoughts

As a final note, data poisoning techniques are not limited to nefarious uses. For example, poisoning techniques can be used for privacy protection and, similar to the concept of “radioactive data”, poisoning techniques can also be used for copyright enforcement, “watermarking” copyrighted data with diverse, undetectable perturbations.

Thanks for reading!