Defining The Prompt-Level AI Jailbreaking Techniques

Prompt-level attacks are considered social-engineering-based, semantically meaningful prompts which elicit objectionable content from LLMs, distinguishing them from token-level attacks that use mathematical optimization of raw token sequences.

Now, let’s consider specific prompt-level attack algorithms. Note that while I’ve categorized the following based on their primary approach, some of these attacks combine multiple techniques.

1. Role-Playing & Persona-Based Attacks

These attacks involve asking the AI to play a specific role or persona to bypass existing guardrails for divulging sensitive or protected information and exploit the model’s ability to simulate different characters or entities, using role-play scenarios to circumvent safety measures. Examples of these role-playing and persona-based attacks include DAN (Do Anything Now), God-Mode, and Bad Likert Judge.

DAN (Do Anything Now)

DAN (Do Anything Now) represents conversational exploits where the attacker instructs the model to act as an unrestricted version of itself that can “do anything now” without following its usual safety guidelines.

God-Mode

God-Mode involves role-playing prompts that frame the AI as an omnipotent entity without limitations, encouraging it to bypass restrictions by assuming a position of absolute authority or capability.

Bad Likert Judge

Bad Likert Judge manipulates LLMs to generate harmful content using Likert scales, exploiting the model’s evaluation capabilities by asking it to rate harmful content on a scale, gradually normalizing the production of such content.

2. Language & Obfuscation Strategies

These techniques involve changing the language in which the prompt is written or using complex language to confuse the AI, disguising harmful requests through linguistic manipulation, encoding, or visual representations. Examples of specific attack algorithms using these strategies include ASCII Art attacks, CodeChameleon, and CodeAttack.

ASCII Art attacks

ASCII Art attacks involve attacking LLMs and toxicity detection systems with ASCII Art to mask profanity, using text-based visual representations to hide offensive content from safety filters while remaining interpretable to the model.

CodeChameleon

CodeChameleon is a personalized encryption framework for jailbreaking LLMs that employs customized encoding or encryption methods to disguise harmful prompts as seemingly innocent code or ciphered text.

CodeAttack

CodeAttack focuses on revealing safety generalization challenges of LLMs via code completion, exploiting the model’s code generation capabilities by embedding harmful instructions within programming contexts where safety measures may be relaxed.

3. Context Manipulation Techniques

Context manipulation involves distracting the AI to bypass its safety protocols or manipulating the conversational context to make harmful requests appear legitimate. These attacks work by shifting the model’s attention or framing requests within acceptable contexts. Let’s briefly consider context manipulation algorithms such as Semantic Mirror Jailbreak, ReNeLLM, and DRA.

Semantic Mirror Jailbreak

Semantic Mirror Jailbreak uses genetic algorithm based Jailbreak prompts against open-source LLMs, employing evolutionary algorithms to systematically discover prompt variations that semantically mirror harmful requests while evading detection.

ReNeLLM

ReNeLLM, described as a “wolf in sheep’s clothing” uses generalized nested Jailbreak prompts to fool LLMs easily; it embeds harmful requests within multiple layers of seemingly benign context, like nesting a malicious query inside an innocent story or scenario.

DRA

DRA (Disguise and Reconstruction) involves splitting harmful requests into disguised components that are reconstructed by the model itself through guided questioning.

4. Gradual/Progressive Techniques

Gradual techniques involve progressively guiding the LLM towards sensitive topics through conversation, asking questions that subtly encourage the LLM to push boundaries and potentially violate norms. These attacks slowly escalate from innocent to harmful content. Let’s define popular progressive attack algorithms such as SequentialBreak, Chain-of-Lure, and Siren.

SequentialBreak

SequentialBreak demonstrates how LLMs can be fooled by placing harmful instructions within a sequence of legitimate requests and exploiting the model’s tendency to maintain consistency across sequential prompts.

Chain-of-Lure

Chain-of-Lure presents a synthetic narrative-driven approach to compromise LLMs, creating compelling narrative chains that gradually lead the model from safe to unsafe territory through storytelling techniques.

Siren

Siren is A learning-based multi-turn attack framework for simulating real-world human Jailbreak behaviors, mimicking natural human conversation patterns to gradually break down the model’s defenses over multiple interaction turns.

5. Social Engineering Approaches

These attacks exploit psychological manipulation tactics similar to human social engineering, tricking the AI into ignoring its refusal protocols through persuasion, authority claims, or emotional manipulation. Let’s take a look at the following social engineering approaches: How Johnny Can Persuade, GUARD, and DrAttack.

How Johnny Can Persuade

How Johnny Can Persuade involves using human psychological persuasion techniques to convince the model to comply with harmful requests by appealing to empathy or logic.

GUARD

GUARD tests guideline adherence by systematically generating role-playing scenarios designed to test and exploit weaknesses in model safety guidelines.

DrAttack

DrAttack demonstrates decomposing harmful prompts into seemingly innocent components that are then reconstructed to bypass safety measures.

6. Special Token & Instruction Manipulation

Special Token and Instruction Manipulation attacks utilize special tokens that are typically used during the training phase of LLMs, tricking the LLM into treating parts of the input as if it were its own output. They exploit the model’s instruction-following mechanisms or training artifacts. Algorithms in this category of prompt-level jailbreaking attacks include Self Reminder attacks, Indirect Prompt Injection, and JVD (Jailbreak via Decoding).

Self Reminder attacks

Self Reminder attacks trick the model by making it appear as if it has already agreed to, or started complying with, a harmful request in a previous (fictitious) part of the conversation.

Indirect Prompt Injection

Indirect Prompt Injection focuses on injecting malicious instructions through external content that the model processes, rather than direct user input.

JVD (Jailbreak via Decoding)

JVD (Jailbreak via Decoding) involves probing the safety response boundary of LLMs via Unsafe Decoding Path Generation, exploiting the model’s decoding process to generate harmful content by manipulating the generation pathway.

7. Multi-Turn Strategies

Multi-turn attacks require multiple conversational exchanges to succeed, building context and trust over several interactions before introducing harmful requests. Agent Smith is the most popular multi-turn attack algorithm.

Agent Smith

Agent Smith demonstrates how a single compromised agent can spread Jailbreak attacks exponentially across multiple LLM instances in multi-agent systems.

8. Advanced/Hybrid Techniques

These sophisticated attacks combine multiple strategies or use advanced computational methods like reinforcement learning or optimization algorithms to discover effective Jailbreaks. Example techniques include PathSeeker and AdaPPA.

PathSeeker

PathSeeker involves exploring LLM security vulnerabilities with a reinforcement learning-based Jailbreak approach, using reinforcement learning to automatically discover optimal paths through the model’s decision space that lead to jailbreak success.

AdaPPA

AdaPPA presents an “Adaptive Position Pre-Fill Jailbreak Attack” targeting LLMs, adaptively positioning malicious content within prompts based on the model’s attention patterns and position biases.

Thanks for reading!