A hooded skeleton figure holding a glowing blue vial with a skull and crossbones symbol, surrounded by laboratory equipment and DNA strands.

A Taxonomy Of AI Training Data Poisoning Attacks

In this brief taxonomy, training data poisoning attacks are divided into the following categories:

  • Bilevel Optimization Poisoning Attacks
  • Feature Collision Poisoning Attacks
  • Federated Learning Model Poisoning Attacks
  • Generative Model Poisoning Attacks
  • Influence Functions Poisoning Attacks
  • Label-Flipping Poisoning Attacks
  • p-Tampering Poisoning Attacks
  • Vanishing Gradient Poisoning Attacks

Bilevel Optimization Poisoning Attacks

These attacks frame the poisoning problem as a bilevel optimization where the attacker solves an outer optimization problem (choosing poisoned data) while anticipating the defender’s inner optimization problem (training the model). The attacker essentially optimizes their poisoning strategy by predicting how the model will be trained on the corrupted dataset.

Feature Collision Poisoning Attacks

These attacks manipulate training data so that samples from different classes have similar or identical feature representations in the model’s feature space. This causes the model to confuse different classes, as their learned representations become indistinguishable despite belonging to different categories.

Federated Learning Model Poisoning Attacks

In federated learning settings where multiple clients train a shared model, these attacks involve malicious clients submitting corrupted model updates. The poisoned updates can degrade global model performance, introduce backdoors, or bias the model toward specific misclassifications when aggregated with legitimate updates.

Generative Model Poisoning Attacks

These target generative models (like GANs or diffusion models) by injecting malicious samples into training data. The goal is to corrupt the model’s learned distribution so it generates inappropriate content, exhibits biases, or produces outputs with hidden backdoor patterns.

Influence Function Poisoning Attacks

These attacks use influence functions, which measure how individual training points affect model predictions, to identify the most effective poisoning points. By understanding which training samples have the highest influence on specific test predictions, attackers can craft minimal but highly effective poisoning sets.

Label-Flipping Poisoning Attacks

One of the simplest poisoning strategies where attackers flip the labels of training samples to incorrect classes while keeping features unchanged. For example, labeling images of dogs as cats. This creates inconsistencies that degrade the model’s ability to learn correct decision boundaries.

p-Tampering Poisoning Attacks

These attacks involve modifying a small fraction p of the training data in specific ways. The tampering can include changing features, labels, or both, with the constraint that only p proportion of the dataset is corrupted.

Vanishing Gradient Poisoning Attacks

These attacks craft poisoned samples that cause gradient computations during training to become extremely small or zero. This effectively stalls learning for certain parts of the model or specific classes, preventing the model from properly learning to classify certain inputs or causing training instability.

Thanks for reading!