Model Extraction Attacks aim at stealing model architecture, training hyperparameters, learned parameters, or model behavior, and are effective across a broad threat landscape that features many practical attack vectors. Today, let’s discuss the most popular techniques for model extraction. Among the myriad of AI model extraction strategies, the most common approaches include:

Alignment-Aware Extraction
Equation-Solving Attacks
Model Leeching
Path-Finding Attacks
Side-Channel Analysis

1. Alignment-Aware Extraction

Alignment-Aware Extraction goes beyond conventional extraction methods by strategically capturing both the functional capabilities and ethical guardrails implemented in modern AI systems. By specifically accounting for alignment procedures like Reinforcement Learning from Human Feedback (RLHF), these attacks create replicas that mirror not just what the original model can do, but also how it has been trained to behave responsibly according to human preferences and safety considerations. This makes the stolen model nearly indistinguishable from the original in both performance metrics and behavioral characteristics, creating a comprehensive duplication that traditional extraction methods cannot achieve.

2. Equation-Solving Attacks

Equation-Solving Attacks refer to a specialized form of model extraction attack where an adversary reconstructs the exact parameters of a target machine learning model by treating the model’s outputs as solutions to a set of mathematical equations. This technique is particularly effective against models that provide detailed outputs, such as probability scores or confidence values, because these outputs can be mathematically related back to the internal parameters of the model. These attacks use mathematical approaches that formulate model extraction as a system of equations and solve for the target model’s parameters, being especially effective against linear models such as logistic regression or simple Multi-Layer Perceptrons. By systematically querying the model and analyzing its responses, the attacker is able to reverse-engineer the model’s structure and parameters with high precision.

3. Model Leeching

Model Leeching is an approach that uses automated prompt generation techniques to systematically extract knowledge from a target model by focusing on specific domains or capabilities. This technique is particularly relevant for large language models and other knowledge-intensive systems. It employs automated methods to generate diverse prompts that efficiently target specific knowledge domains within the model. By systematically querying with these prompts and analyzing responses, attackers can extract structured knowledge and capabilities from the target model without needing to reproduce its entire parameter space. This technique enables more focused and efficient extraction of the most valuable aspects of a model.

4. Path-Finding Attacks

Path-Finding Attacks are techniques specifically designed for extracting tree-based models by identifying the paths and conditions used in decision making. These attacks target decision trees and regression trees by systematically varying feature values to uncover the conditions that direct an input sample through specific paths in the tree. By identifying the split conditions at each node and the leaf values, attackers can reconstruct the entire decision tree structure. Path-finding Attacks are particularly effective because they directly map to the algorithmic structure of tree models, where each path represents an explicit rule. However, they can be inefficient, requiring a high number of queries per parameter (between 44-317 queries per parameter in research implementations).

5. Side-Channel Analysis

Side-Channel Analysis involves analyzing physical or logical side effects of model execution to infer information about model architecture or parameters. Side-Channel Attacks exploit unintended information leakage through observable physical or logical system behaviors such as memory usage, timing information, power consumption, or electromagnetic emissions. Rather than directly querying the model, these attacks monitor and analyze indirect signals generated during model execution. For example, by observing memory access patterns or execution times, attackers can infer the architecture of neural networks, including the number of layers and neurons. These attacks are particularly powerful because they can extract information even when direct query access is limited or monitored, and they often require less queries than traditional API-based approaches.

Thanks for reading!