A reflective humanoid robot head with circuitry lines and silhouettes of people in the background, symbolizing AI and technology integration.

Defining The Token-level AI Jailbreaking Techniques

Token-level Jailbreaking optimizes the raw sequence of tokens fed into the LLM to elicit responses that violate the model’s intended behavior. Unlike prompt-level attacks that rely on semantic manipulation, token-level methods treat Jailbreaking as an optimization problem in the discrete token space.

Token-level Jailbreaking often requires hundreds or thousands of queries to breach model defenses, and the results are frequently less interpretable than those from prompt-level attacks. However, the capacity for automated, systematic exploration makes token-level techniques highly effective and scalable for identifying vulnerabilities in LLMs.

Now, let’s consider specific token-level attack algorithms. Note that while I’ve categorized the following based on their primary approach, some of these attacks combine multiple techniques.

1. Gradient-Based Optimization Attacks

Gradient-Based Optimization Attacks use gradient information from the model to systematically search for adversarial token sequences. Popular gradient-based attack algorithms include: GCG, AttnGCG, GBDA, and PEZ.

GCG

The Greedy Coordinate Gradient (GCG) attack is an automatic method for adversarially Jailbreaking aligned LLMs. It employs a coordinate descent strategy that greedily updates tokens in an adversarial suffix to maximize the likelihood of eliciting harmful responses.

AttnGCG

AttnGCG is an enhanced version of GCG that manipulates models’ attention scores to facilitate LLM Jailbreaking. Empirically, AttnGCG shows consistent improvements in attack efficacy across diverse LLMs, achieving an average increase of ~7% in the Llama-2 series and ~10% in the Gemma series.

GBDA

GBDA (Gradient-Based Distributional Attack) uses gradients to optimize adversarial inputs designed to alter model outputs. It extends gradient-based approaches by considering the distributional properties of the model’s outputs.

PEZ

PEZ optimizes embedding representations iteratively to influence targeted model behaviors. This method works directly in the embedding space before projecting back to discrete tokens.

2. Evolutionary & Search-Based Algorithms

These methods discover effective adversarial token sequences without requiring gradient information. Popular attack algorithms include: AutoDAN, AmpleGCG, and AmpleGCG-Plus.

AutoDAN

AutoDAN uses handcrafted Jailbreak prompts, such as the DAN, as the starting point for its prompt optimization algorithm, and employs a two-level genetic algorithm to refine prompts at both the sentence and word levels, preserving grammatical and semantic integrity.

AmpleGCG

AmpleGCG is a universal and transferable generative model of adversarial suffixes for Jailbreaking both open and closed LLMs.

AmpleGCG-Plus

AmpleGCG-Plus is a strong generative model of adversarial suffixes for Jailbreaking LLMs that improves on the original AmpleGCG with a more efficient suffix generation.

3. Optimization in Continuous Spaces

These techniques optimize in continuous representation spaces before converting back to discrete tokens. Algorithms in this category of token-level Jailbreaking attacks include LARGO and Functional Homotopy.

LARGO

LARGO (Latent Adversarial Reprogramming via Gradient Optimization) sidesteps the challenges of discrete prompt engineering by searching directly in embedding space and then leveraging the LLM’s own interpretive abilities to produce readable, benign-looking prompts. LARGO outperforms GCG, AutoDAN, and AdvPrompter by an average of 22.0%, 27.3%, and 57.8%, respectively.

Functional Homotopy

Functional Homotopy converts discrete optimization problems into continuous ones for LLM Jailbreak attacks.

4. Black-Box & Transfer Attacks

These methods work without direct access to model gradients, making them applicable to closed-source models. Popular examples of black-box and transfer attacks include PAL and PAIR.

PAL

PAL (Proxy-Guided Attack on Large Language Models) adapts GCG to black-box settings, and addresses the limitation of GCG’s gradient requirement by using a surrogate model to generate gradients.

PAIR

PAIR (Prompt Automatic Iterative Refinement) assumes black-box model access, generates Jailbreaks readable by humans, and is considered a bridge between token and prompt-level approaches.

5. Hybrid Token-Level Techniques

Hybrid methods combine token-level optimization with other approaches for improved effectiveness. Popular hybrid token-level attack algorithms include: AdvPrompter and ASETF.

AdvPrompter

AdvPrompter combines LLM-based generation with optimization techniques to create adversarial prompts that are both effective and less detectable than pure token-level approaches.

ASETF

ASETF (Adversarial Suffix Enhancement via Text Filtering) translates text embeddings to unravel the adversarial suffix of LLMs.

Thanks for reading!