API Querying is a systematic approach where attackers send repeated inputs to a model hosted as a service and collect the corresponding outputs to reconstruct the model’s functionality. This is the most common attack vector in the Machine-Learning-as-a-Service (MLaaS) paradigm, where models are accessible through public APIs. Attackers exploit legitimate API access to systematically submit inputs and record the model’s responses, building a dataset of input-output pairs that can be used to train a substitute model. This attack vector is particularly effective because it operates within the intended use of the API, making it difficult to detect and distinguish from legitimate usage patterns

What Is API Querying?

API Querying in the context of AI model extraction refers to the process by which an adversary interacts with a deployed machine learning model – typically accessible via an API – by sending input queries and analyzing the corresponding outputs. The primary goal of this attack vector is to reconstruct, approximate, or steal the underlying model, its parameters, or its training data, often without any direct access to the model’s internal workings. This attack vector is particularly critical for large language models (LLMs) and other AI services delivered through cloud-based APIs, as it exposes proprietary models to potentially untrusted users. Attackers can exploit this interface to replicate the model’s functionality, violate intellectual property rights, or extract sensitive data, posing significant risks to both privacy and commercial interests. Problematically, the very APIs that make modern AI models convenient targets for extraction attacks also provide the attack vector for systematic querying that can lead to model theft.

API querying attacks can be categorized based on the attacker’s knowledge and objectives. The two most common settings are black-box and gray-box attacks, each with distinct techniques and implications.

Black-Box API Querying In AI Model Extraction

In black-box attacks, the adversary has no internal knowledge of the model. They interact solely through the API, sending inputs and recording outputs to infer the model’s behavior. This is the most common and practical attack scenario, as it requires only the ability to query the model and observe its responses, making it applicable to commercial API services where internal model details are protected.

Model Functionality Extraction

Model functionality extraction is a fundamental approach where attackers systematically query the target model with diverse inputs, collecting input-output pairs to train a substitute model that mimics the target model’s decision boundaries and outputs. This method creates a functional “clone” that can then be analyzed to indirectly infer characteristics of the original training data or be leveraged for further attacks. The process involves querying the model with a wide range of inputs designed to cover the input space effectively, then using the resulting dataset to train a surrogate model with similar capabilities. Researchers have demonstrated this approach successfully against various types of models, including complex BERT-based language models, where attackers can craft queries spanning diverse domains and complexity levels to recreate the model’s behavior. The success of this method often depends on how comprehensively the queries sample the input space and how effectively the substitute model architecture can capture the target model’s behavior pattern. The ultimate goal is to create a model that performs the same function without requiring access to the original model or paying for its API usage.

Basic I/O Harvesting

Basic input-output (I/O) harvesting is the most straightforward approach to model extraction, involving systematic querying of the target model with diverse inputs and recording the corresponding outputs to create a dataset of input-output pairs. This dataset serves as the foundation for training a “student” model to mimic the behavior of the target “teacher” model. The approach requires no special knowledge about the model’s architecture or training methodology—just the ability to submit queries and receive predictions. Attackers typically choose inputs that provide good coverage of the input space, aiming to capture as much of the target model’s functionality as possible with a limited query budget. This method works because machine learning models essentially encode a mapping from inputs to outputs, and by collecting enough examples of this mapping, a substitute model can learn to approximate the same function. The effectiveness of basic I/O harvesting depends on several factors, including the complexity of the target model, the dimensionality of the input space, and how representative the queried samples are of the model’s operational domain. While conceptually simple, this approach can be surprisingly effective, especially when combined with strategic sampling techniques that maximize information gain per query.

Witness-Finding

The witness-finding approach, pioneered by Lowd and Meek, is a specialized technique for extracting parameters of linear binary models. This method works by identifying “sign witnesses”—pairs of samples that are identical except for one feature value and that receive different classifications from the target model. These boundary points provide critical information about the model’s decision boundary. By systematically finding sign witnesses for each feature and analyzing their values, attackers can infer the relative weights of different features in the model. The process typically involves starting with one positive and one negative sample, then methodically changing feature values of the positive sample one by one until finding combinations that flip the classification, thereby revealing information about how each feature influences the model’s decision. Once sign witnesses are identified, the attacker can use techniques like line search to determine the precise weight values. While the witness-finding approach can exactly extract model weights, leading to perfect replication of the target model’s behavior, it requires a relatively high number of queries per parameter (at least 11 according to empirical studies), making it potentially inefficient for large models with many parameters. This technique has been adapted for other linear models, including Support Vector Machines (SVMs) and Support Vector Regression Machines (SVRs).

Path-Finding Attacks

Path-finding attacks are specifically designed to extract decision trees and regression trees by systematically uncovering the structure of the decision paths within the tree. These attacks exploit the sequential nature of tree-based models, where inputs follow a specific path from the root node to a leaf node based on a series of binary decisions. The attacker sends an initial input to the target tree model and collects not only the output prediction but also (if available) an identifier for the leaf node that produced that prediction. Then, by methodically varying the values of different features and observing how the path changes, the attacker can reverse-engineer the conditions at each decision node in the tree. This process reconstructs both the structure of the tree and the threshold values used for decision-making at each node. Path-finding attacks are particularly effective because they directly map to the algorithmic structure of tree models, where each path represents an explicit rule. However, they can be inefficient, requiring a high number of queries per parameter (between 44-317 queries per parameter in research implementations). The attack’s success also depends on whether leaf nodes have unique identifiers; without these, distinguishing between different paths that lead to the same prediction becomes more challenging, potentially reducing the fidelity of the extracted model.

Targeted Domain Extraction

Targeted domain extraction refines the general model extraction approach by focusing queries on specific domains or capabilities where the target model demonstrates particular expertise. Rather than attempting to replicate the entire functionality of a model, attackers concentrate their efforts on a subset of the model’s capabilities that are of greatest interest or value. For example, when targeting a medical LLM, an attacker might focus queries on medical terminology, diagnoses, and treatment recommendations, efficiently capturing domain-specific knowledge without wasting queries on general capabilities. This focused approach offers several advantages: it requires fewer total queries, increases extraction efficiency by concentrating on high-value functionality, and can produce a more accurate substitute model within the targeted domain. The technique is particularly effective against large, multi-purpose models that have specialized capabilities in certain domains. Targeted domain extraction represents a pragmatic strategy for attackers with specific goals, allowing them to maximize the utility of the extracted model for particular applications while minimizing the resources required for the attack. This method acknowledges that complete model extraction might be unnecessary if the attacker’s interests lie in a specific subset of the model’s overall capabilities.

Gray-Box API Querying In AI Model Extraction

In gray-box attacks, the attacker has partial knowledge of the model, such as its architecture or the distribution of training data. This additional information can be leveraged to craft more effective queries and enhance the attack’s success. Gray-box scenarios represent an intermediate level of access between complete knowledge (white-box) and no knowledge (black-box), reflecting real-world situations where some information about the model might be publicly available or deducible from documentation, research papers, or marketing materials. Popular gray-box API querying model extraction approaches include Side-Channel Exploitation, Gradient-Based Extraction, Weight Probing, Confidence Extraction, and Equation-Solving Attacks.

Side-Channel Exploitation

Side-channel exploitation combines API querying with the analysis of unintended information leakage through side-channels such as timing, memory access patterns, power consumption, or electromagnetic emanations. These channels can reveal significant details about a model’s architecture, parameters, or operations without directly accessing its internals. For instance, by measuring the execution time of different queries, attackers can infer the computational complexity of processing particular inputs, potentially revealing information about the model’s internal structure. Similarly, observations of cache access patterns can leak information about which parts of a neural network are activated by specific inputs. This approach is particularly effective in edge computing or federated learning environments, where the model runs on hardware that the attacker can monitor. Advanced side-channel attacks have successfully extracted neural network architectures by analyzing electromagnetic signals emitted during inference or by observing power consumption patterns. The combination of side-channel information with strategic API querying creates a powerful attack vector that can significantly reduce the number of queries needed to extract a model, as each query provides additional information through these covert channels that would not be available through the API response alone.

Gradient-Based Extraction

Gradient-based extraction leverages gradient information exposed by the API to reconstruct model parameters with high precision. Some machine learning APIs provide gradient information for purposes such as explainability or fine-tuning, inadvertently creating an attack surface. When gradients are directly available, they provide rich information about how the model’s parameters influence its outputs, essentially offering a window into the model’s inner workings. Even when gradients aren’t explicitly provided, attackers can sometimes approximate them by observing how small changes in inputs affect outputs. By systematically collecting gradients or gradient approximations for a strategically chosen set of inputs, attackers can formulate systems of equations that, when solved, reveal the model’s parameters. This extraction method is significantly more efficient than basic I/O harvesting, requiring fewer queries to achieve higher fidelity replication. Gradient-based extraction is particularly effective against models with simpler architectures like linear models and shallow neural networks, but techniques have been developed to extend it to more complex architectures as well. The attack highlights the security trade-off between providing helpful model information (like gradients for explainability) and protecting model confidentiality – a tension that system designers must carefully balance.

Weight Probing

Weight probing is a sophisticated technique where attackers craft inputs specifically designed to extract information about a model’s internal weights and parameters through careful analysis of API responses. Unlike general extraction methods that treat the model as a complete black box, weight probing attempts to directly infer specific parameters by observing how slight modifications to inputs affect the outputs. The approach works by identifying inputs that lie near the model’s decision boundaries or critical activation thresholds, then making minimal modifications to these inputs to observe changes in the output. For neural networks with ReLU activation functions, for example, attackers can identify “critical points” where neurons switch from inactive to active states, revealing information about the corresponding weight vectors. By collecting enough of these observations and analyzing patterns in the responses, attackers can progressively reconstruct the model’s weight matrices with high accuracy. Weight probing attacks are particularly powerful against neural networks because they exploit the mathematical properties of the network’s architecture to directly extract parameters rather than simply mimicking behavior. However, they typically require a larger number of queries compared to some other extraction methods, with the query efficiency improving for larger models where patterns become more discernible across multiple parameters.

Confidence Extraction

Confidence extraction specifically targets the probability distributions or confidence scores that many machine learning APIs return alongside their primary predictions. These confidence values provide a rich source of information about the model’s internal decision-making process. By examining how confidence scores change across different inputs, attackers can infer details about decision boundaries, feature importance, and internal thresholds that would not be apparent from discrete predictions alone. The technique involves systematically querying the model with strategically designed inputs that probe different regions of the input space, then analyzing the patterns and gradients in the resulting confidence distributions. This information allows attackers to map out decision boundaries with high precision and understand how the model weighs different features in its decision process. Confidence extraction is particularly effective because many APIs return these scores by default, offering significantly more information than would be available from class labels alone. The granularity of confidence scores provides a continuous signal that reveals subtle details about the model’s behavior near decision boundaries, where the most informative points for extraction lie. Defending against this attack often involves limiting the precision of confidence scores or adding noise to them, creating a trade-off between model transparency for legitimate users and security against extraction attacks.

Equation-Solving Attacks

Equation-solving attacks represent a powerful and direct approach to model extraction when the architecture of the target model is known. The fundamental idea is elegantly simple: each input-output pair obtained from querying the model can be expressed as an equation where the model’s parameters are the unknowns. By collecting enough distinct input-output pairs, attackers can construct a system of equations and solve for these parameters. This method is particularly effective against models with clearly defined mathematical structures, such as logistic regression models, support vector machines with linear or quadratic kernels, and shallow multi-layer perceptrons. For example, in a linear model, each output is a weighted sum of inputs plus a bias term, forming a linear equation where the weights and bias are the parameters to be extracted. Once enough equations are gathered, standard matrix algebra techniques can solve for the exact values of these parameters. Equation-solving attacks are remarkably efficient, often requiring only 1-4 queries per parameter, and they achieve perfect extraction scores when successful. However, they become computationally intensive for models with large numbers of parameters and are less effective against highly non-linear models with complex architectures. The attack also requires the architecture to be known in advance, though this information might be deducible through other extraction techniques or from public documentation.

Thanks for reading!