Defining The Token-level AI Jailbreaking Techniques
Token-level Jailbreaking optimizes the raw sequence of tokens fed into the LLM to elicit responses that violate the model’s intended behavior. Unlike prompt-level attacks that rely on semantic manipulation, token-level methods treat Jailbreaking as an optimization problem in the discrete token space.
Token-level Jailbreaking often requires hundreds or thousands of queries to breach model defenses, and the results are frequently less interpretable than those from prompt-level attacks. However, the capacity for automated, systematic exploration makes token-level techniques highly effective and scalable for identifying vulnerabilities in LLMs.
Now, let’s consider specific token-level attack algorithms. Note that while I’ve categorized the following based on their primary approach, some of these attacks combine multiple techniques.
1. Gradient-Based Optimization Attacks
Gradient-Based Optimization Attacks use gradient information from the model to systematically search for adversarial token sequences. Popular gradient-based attack algorithms include: GCG, AttnGCG, GBDA, and PEZ.
GCG
The Greedy Coordinate Gradient (GCG) attack is an automatic method for adversarially Jailbreaking aligned LLMs. It employs a coordinate descent strategy that greedily updates tokens in an adversarial suffix to maximize the likelihood of eliciting harmful responses.
AttnGCG
AttnGCG is an enhanced version of GCG that manipulates models’ attention scores to facilitate LLM Jailbreaking. Empirically, AttnGCG shows consistent improvements in attack efficacy across diverse LLMs, achieving an average increase of ~7% in the Llama-2 series and ~10% in the Gemma series.
GBDA
GBDA (Gradient-Based Distributional Attack) uses gradients to optimize adversarial inputs designed to alter model outputs. It extends gradient-based approaches by considering the distributional properties of the model’s outputs.
PEZ
PEZ optimizes embedding representations iteratively to influence targeted model behaviors. This method works directly in the embedding space before projecting back to discrete tokens.
2. Evolutionary & Search-Based Algorithms
These methods discover effective adversarial token sequences without requiring gradient information. Popular attack algorithms include: AutoDAN, AmpleGCG, and AmpleGCG-Plus.
AutoDAN
AutoDANÂ uses handcrafted Jailbreak prompts, such as the DAN, as the starting point for its prompt optimization algorithm, and employs a two-level genetic algorithm to refine prompts at both the sentence and word levels, preserving grammatical and semantic integrity.
AmpleGCG
AmpleGCG is a universal and transferable generative model of adversarial suffixes for Jailbreaking both open and closed LLMs.
AmpleGCG-Plus
AmpleGCG-Plus is a strong generative model of adversarial suffixes for Jailbreaking LLMs that improves on the original AmpleGCG with a more efficient suffix generation.
3. Optimization in Continuous Spaces
These techniques optimize in continuous representation spaces before converting back to discrete tokens. Algorithms in this category of token-level Jailbreaking attacks include LARGO and Functional Homotopy.
LARGO
LARGO (Latent Adversarial Reprogramming via Gradient Optimization) sidesteps the challenges of discrete prompt engineering by searching directly in embedding space and then leveraging the LLM’s own interpretive abilities to produce readable, benign-looking prompts. LARGO outperforms GCG, AutoDAN, and AdvPrompter by an average of 22.0%, 27.3%, and 57.8%, respectively.
Functional Homotopy
Functional Homotopy converts discrete optimization problems into continuous ones for LLM Jailbreak attacks.
4. Black-Box & Transfer Attacks
These methods work without direct access to model gradients, making them applicable to closed-source models. Popular examples of black-box and transfer attacks include PAL and PAIR.
PAL
PAL (Proxy-Guided Attack on Large Language Models) adapts GCG to black-box settings, and addresses the limitation of GCG’s gradient requirement by using a surrogate model to generate gradients.
PAIR
PAIR (Prompt Automatic Iterative Refinement) assumes black-box model access, generates Jailbreaks readable by humans, and is considered a bridge between token and prompt-level approaches.
5. Hybrid Token-Level Techniques
Hybrid methods combine token-level optimization with other approaches for improved effectiveness. Popular hybrid token-level attack algorithms include: AdvPrompter and ASETF.
AdvPrompter
AdvPrompter combines LLM-based generation with optimization techniques to create adversarial prompts that are both effective and less detectable than pure token-level approaches.
ASETF
ASETF (Adversarial Suffix Enhancement via Text Filtering) translates text embeddings to unravel the adversarial suffix of LLMs.
Thanks for reading!