Model Leeching is a Model Extraction attack in which an adversary siphons task-specific knowledge from a target large language model (LLM) by interacting with it solely through its public API (API Querying) – the attacker does not require access to the model’s architecture, parameters, or training data. Instead, they systematically query the LLM to generate a dataset of input-output pairs for a specific task, which is then used to train a smaller, local model that closely replicates the target’s performance on that task.
Model Leeching stands out due to its focus on black-box access and task-specific extraction. The attacker’s goal is not to clone the entire LLM, but to distill its expertise in a particular domain – such as question answering or summarization – into a compact, reduced-parameter model. This approach is highly cost-effective, requiring only a modest investment in API queries to achieve high similarity and accuracy compared to the target model. Additionally, the extracted model can be leveraged to stage further attacks, such as Model Inversion and generating Adversarial Examples.
How Do Model Leeching Attacks Work?
Model Leeching employs automated methods to generate diverse prompts that efficiently target specific knowledge domains within the model. By systematically querying with these prompts and analyzing responses, attackers can extract structured knowledge and capabilities from the target model without needing to reproduce its entire parameter space. Model Leeching enables focused and efficient extraction of the most valuable aspects of a model (high-value knowledge domains), and is applicable to any LLM with a public API endpoint.
Model Leeching attacks follow a four-phase process: Prompt Design, Data Generation, Extracted Model Training, and ML Attack Staging. Let’s review each.
Prompt Design
The Prompt Design process follows an iterative approach, typically requiring multiple variations and refinements to devise the most effective instructions and styles for obtaining desired results from a specific LLM for a given task. There are three steps to Prompt Design, including Knowledge Discovery, Construction, and Validation.
Knowledge Discovery
An adversary first defines the type of task knowledge to extract. Once defined, an adversary assesses specific target LLM prompt responses to ascertain its affinity to generate task knowledge. This assessment encompasses domain (NLP, image, audio, etc.), response patterns, comprehension limitations, and instruction adherence for particular knowledge domains. Following successful completion of this assessment, the adversary is able to devise an effective strategy to extract desired characteristics.
Construction
Subsequently, the adversary crafts a prompt template that integrates an instruction set reflecting the strategy formulated during the knowledge discovery stage. Template design encompasses the distinctive response structure of the target LLM, its recognized limitations, and task-specific knowledge identified for extraction. This template facilitates dynamic prompt generation within the Model Leeching process.
Validation
The adversary validates the created prompt and response generated from the target LLM. Validation ensures that the LLM responds reliably to prompts, represented as a consistent response structure and ability to carry out given instructions. This validation activity enables the Model Leeching method to generate responses that can be used to effectively train local models with extracted task-specific knowledge.
Data Generation
Deriving extracting model characteristics by querying the target LLM with the designed prompts and collecting responses.
Once a suitable prompt has been designed, the adversary targets the given LLM. This refined prompt is specified to capture desired LLM purpose and task (e.g., Summarization, Chat, Question & Answers, etc.) to be instilled within the extracted model. Given a ground truth dataset, all examples are processed into prompts recognized as valid target LLM inputs. Once all queries have been processed by the target LLM, the adversary generates an adversarial dataset, combining inputs with received LLM replies, as well as automated validation (removing API request errors, failed, or erroneous prompts). This process can be distributed and parallelized to minimize collection time, as well as mitigate the impact of rate-limiting and/or detection by filtering systems when interacting with the web-based LLM API.
Extracted Model Training
Training a smaller model on the collected data to replicate the target model’s behavior on the specific task.
Using the adversarial dataset, data is split into train and evaluation sets used for extracted model training and attack success evaluation. A pre-trained or empty base model is selected for distilling knowledge from the target LLM. This base model is then trained upon with selected hyper-parameters, producing an extracted model. Using an evaluation set, similarity and accuracy in a given task can be evaluated and compared using answers generated by the extracted and target models.
ML Attack Staging
Using the extracted model as a staging ground for further attacks against the target LLM or other models.
Access to an extracted model (local to an adversary) created from a target LLM facilitates the execution of augmented adversarial attacks. This extracted model allows an adversary to perform unrestricted model querying to test, modify or tailor adversarial attack(s) to discover exploits and vulnerabilities against a target LLM. Furthermore, access to an extracted model enables an adversary to operate in a sandbox environment to conduct adversarial attacks prior to executing the same attack(s) against the target LLM in production.
Defenses Against Model Leeching Attacks
By understanding Model Leeching and implementing layered defenses, organizations can better protect their AI models from unauthorized replication and adversarial attacks. Popular defense methods against Model Leeching Attacks include API Rate Limiting, Differential Privacy Techniques, Domain-Specific Knowledge Protection, Ensemble Response Mechanisms, Limiting Output Detail, Output Perturbation Strategies, Output Watermarking, Query Anomaly Detection, and Response Randomization.
API Rate Limiting
API rate limiting involves restricting the number of queries a user or API key can make within a given timeframe. By enforcing strict limits, organizations can significantly slow down the data collection process required for model leeching. This increases the time and cost for attackers, making large-scale extraction attempts less feasible. While rate limiting may also affect legitimate users, it remains an effective first line of defense against automated and high-frequency extraction attacks.
Differential Privacy Techniques
Differential privacy techniques involve adding noise or using privacy-preserving mechanisms in model outputs to prevent the leakage of sensitive training data or model internals. These techniques protect individual data points and model parameters from being inferred through extraction attacks, while still maintaining utility for legitimate users. By making it mathematically difficult to reconstruct the original model or its training data, differential privacy offers a robust defense against a wide range of model extraction threats.
Domain-Specific Knowledge Protection
This defense specifically targets the protection of valuable knowledge domains that are likely targets for leeching attacks. By identifying high-value capabilities and implementing enhanced protection mechanisms for queries that appear to target these domains, system owners can make extraction of the most valuable aspects of their models more difficult. Benefits include more efficient allocation of defensive resources, focusing strongest protections on the most commercially valuable capabilities while maintaining overall system performance.
Ensemble Response Mechanisms
Ensemble response mechanisms dynamically route queries to different model versions or configurations based on risk assessment, making it difficult for attackers to build a consistent map of any single model’s knowledge or capabilities. By varying the responding model or combining responses from multiple models for high-risk queries, this approach introduces inconsistency that undermines extraction efforts. The benefit is that attackers cannot create an accurate replica because they aren’t consistently interacting with the same model, effectively poisoning their training dataset.
Limiting Output Detail
Limiting output detail means restricting the granularity or richness of information provided in API responses, especially for sensitive or proprietary tasks. By reducing the amount of useful information available in each response, organizations make it more difficult for attackers to gather the high-quality data necessary for effective model extraction. This approach can be tailored to specific tasks or user groups to balance security with usability.
Output Perturbation Strategies
This defense selectively modifies outputs in ways that preserve utility for legitimate users but introduce errors in extracted models. Unlike differential privacy which adds random noise, perturbation strategies can target modifications to areas most vulnerable to extraction or most valuable to protect. The benefit is a customizable approach that can prioritize protecting proprietary capabilities while minimizing impact on general performance, creating a better security-utility trade-off than uniform protection approaches.
Output Watermarking
Output watermarking is the practice of embedding subtle, hard-to-remove patterns or signals in the model’s responses. These watermarks are designed to be undetectable to normal users but can be traced by the model provider if stolen outputs are discovered in downstream models. Watermarking not only enables the identification of stolen content but also acts as a deterrent, as attackers risk exposure if their extracted models produce watermarked outputs that can be linked back to the original LLM.
Query Anomaly Detection
Query anomaly detection involves monitoring API usage patterns to identify suspicious or extraction-like behavior. This includes detecting high-frequency, repetitive, or systematically varied queries that deviate from normal user activity. When such patterns are detected, organizations can flag or block the offending accounts, preventing further extraction attempts. Combined with other measures, anomaly detection provides a proactive way to identify and mitigate model leeching in real time.
Response Randomization
Response randomization introduces controlled variability or randomness into the model’s outputs for identical or similar prompts. This reduces the consistency and quality of the data that attackers can collect, making it more difficult to train a high-fidelity replica of the target model. By increasing the noise in the extracted dataset, response randomization lowers the performance and reliability of any model trained using stolen outputs, thereby diminishing the value of the extraction.
Thanks for reading!