A person with a yellow helmet watches a police lineup with two men standing in front of a height chart behind glass.

The Big List Of AI Jailbreaking References And Resources

06/08/2025|Brian D. Colwell|Artificial Intelligence

Executive Summary

This curated collection of references and resources serves as a comprehensive research repository, bringing together academic papers, industry analyses, and empirical evaluations that illuminate the cat-and-mouse game between those who build AI safety mechanisms and those who seek to circumvent them. The materials assembled here span the full spectrum of this dynamic threat domain: from foundational attack techniques and gradient-based optimization methods, to multi-modal exploits, defense frameworks, and the philosophical questions surrounding what constitutes “safe” AI behavior.

Several critical themes emerge from this body of research, revealing the multifaceted nature of the jailbreaking problem:

Multi-Modal & New Defenses Expand Attack Landscape

First, the multi-modal dimension has opened entirely new attack surfaces. Vision-language models face unique vulnerabilities through adversarial images, typographic attacks using ASCII art, embedded instructions in visual prompts, and cross-modal inconsistencies that allow attackers to hide malicious intent in one modality while presenting benign content in another. Audio language models face similar challenges through acoustic adversarial examples and multilingual accent-based exploits.

Next, new attacks beget new defenses beget new attacks, and the jailbreaking landscape reveals a classic security arms race. Early defenses relied on simple keyword filtering and pattern matching, which were quickly defeated by obfuscation techniques, encoding schemes, and linguistic creativity. More sophisticated defenses emerged: embedding-based detection systems, uncertainty-driven defense mechanisms, gradient analysis of safety-critical features, and architectures that separate reasoning about safety from content generation itself.

Multi-Agent & Automation Expand Attack Sophistication

The multi-agent dimension has expanded attack sophistication. As LLMs evolve from isolated question-answering systems into autonomous agents with tool access, memory, and the ability to execute actions across multiple turns, the jailbreaking problem expands exponentially in complexity. Multi-agent systems introduce additional vulnerabilities: prompt infection attacks where one compromised agent can spread jailbreak behaviors to others, cross-agent exploitation where benign-seeming agents can be combined to produce harmful outputs, and emergent behaviors in agent interactions that bypass individual agent safeguards.

And, the democratization of jailbreaking capabilities through automation has accelerated dramatically. What once required expert knowledge of model internals and optimization techniques can now be accomplished through automated red-teaming frameworks, genetic algorithms, and even LLM-based jailbreak generators that can create novel attack prompts with minimal human intervention.

“Jailbreak Tax” Proves Unsolvable

Finally, the tension between safety and utility has proven fundamental and perhaps unsolvable. Many defenses against jailbreaking—such as aggressive content filtering, conservative refusal training, or strict output constraints—come at the cost of reduced helpfulness, increased false positives, and degraded performance on legitimate edge cases. This “jailbreak tax” represents a fundamental trade-off that researchers and developers must carefully navigate.

Introduction

Jailbreaking represents one of the most persistent and evolving challenges in the safety and alignment of large language models. Unlike prompt injection—which exploits the architectural vulnerability of mixing trusted instructions with untrusted data—jailbreaking attacks directly target the safety guardrails that developers have carefully constructed through alignment training, fine-tuning, and reinforcement learning from human feedback (RLHF).

Whether you’re a security researcher investigating LLM vulnerabilities, a developer implementing safety measures in AI-integrated applications, or a practitioner seeking to understand the risk landscape, this resource provides essential context for navigating one of artificial intelligence‘s most pressing security and ethical challenges.

Reader note – you may also be interested in these other articles on artificial intelligence:

A Brief Introduction To AI Model Inversion Attacks – https://briandcolwell.com/a-brief-introduction-to-ai-model-inversion-attacks/
A Brief Introduction To AI Prompt Injection Attacks – https://briandcolwell.com/a-brief-introduction-to-ai-prompt-injection-attacks/
A History Of AI Jailbreaking Attacks – https://briandcolwell.com/a-history-of-ai-jailbreaking-attacks/
A History Of Clean-Label AI Data Poisoning Backdoor Attacks – https://briandcolwell.com/a-history-of-clean-label-ai-data-poisoning-attacks/
AI Supply Chain Attacks Are A Pervasive Threat – https://briandcolwell.com/ai-supply-chain-attacks-are-a-pervasive-threat/
An Introduction To AI Model Extraction – https://briandcolwell.com/an-introduction-to-ai-model-extraction/
An Introduction To AI Side-Channel Attacks – https://briandcolwell.com/an-introduction-to-ai-side-channel-attacks/
Gradient And Update Leakage (GAUL) In Federated Learning – https://briandcolwell.com/gradient-and-update-leakage-gaul-in-federated-learning/
Membership Inference Attacks Leverage AI Model Behaviors – https://briandcolwell.com/membership-inference-attacks-leverage-ai-model-behaviors/
What Are AI Sensitive Information Disclosure Attacks? The Threat Landscape – https://briandcolwell.com/what-are-ai-sensitive-information-disclosure-attacks/
What Is AI Training Data Extraction? A Combination Of Techniques – https://briandcolwell.com/what-is-ai-training-data-extraction-a-combination-of-techniques/
What Is Model Leeching? – https://briandcolwell.com/what-is-model-leeching/

The Big List Of AI Jailbreaking References And Resources

This research corpus reveals several particularly concerning attack vectors: fine-tuning attacks that can remove safety guardrails in minutes, backdoor attacks that embed hidden jailbreak triggers during training, model editing techniques that surgically alter safety-relevant knowledge, and “best-of-N” attacks that simply generate many outputs and select the least safe one—a troublingly effective strategy that highlights vulnerabilities in probabilistic safety guarantees.

Note that the below are in alphabetical order by title. Please let me know if there are any sources you would like to see added to this list. Enjoy!

A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models – https://arxiv.org/abs/2312.10982
A Cross-Language Investigation into Jailbreak Attacks in Large Language Models – https://arxiv.org/abs/2401.16765
A False Sense of Safety: Unsafe Information Leakage in ‘Safe’ AI Responses – https://arxiv.org/abs/2407.02551
A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares – https://arxiv.org/abs/2408.05061
A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos – https://arxiv.org/abs/2502.15806
A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection – https://arxiv.org/abs/2312.10766
A StrongREJECT for Empty Jailbreaks – https://arxiv.org/abs/2402.10260
A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations – https://arxiv.org/abs/2502.14881
A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts Can Fool Large Language Models Easily – https://arxiv.org/abs/2311.08268
AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs – https://arxiv.org/abs/2409.07503
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting – https://arxiv.org/abs/2403.09513
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender – https://arxiv.org/abs/2504.09466
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models – https://arxiv.org/abs/2408.14866
Adversarial Attacks on GPT-4 via Simple Random Search – https://www.andriushchenko.me/gpt4adv.pdf
Adversarial Attacks on LLMs – https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/
Adversarial Attacks on Large Language Models Using Regularized Relaxation – https://arxiv.org/abs/2410.19160
Adversarial demonstration attacks on large language models. – https://arxiv.org/abs/2305.14950
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs – https://arxiv.org/abs/2502.15427
Adversarial Reasoning At Jailbreaking Time – https://arxiv.org/html/2502.01633v1
Adversarial Suffixes May Be Features Too! – https://arxiv.org/abs/2410.00451
Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs – https://arxiv.org/pdf/2406.06622
Adversaries Can Misuse Combinations of Safe Models – https://arxiv.org/abs/2406.14595
Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training – https://arxiv.org/abs/2502.11455
AdvPrefix: An Objective for Nuanced LLM Jailbreaks – https://arxiv.org/abs/2412.10321
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs – https://arxiv.org/abs/2404.16873
AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models – https://arxiv.org/abs/2412.08608
AdvWeb: Controllable Black-box Attacks on VLM-powered Web Agents – https://arxiv.org/abs/2410.17401
AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts – https://arxiv.org/abs/2404.05993
AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models – https://arxiv.org/abs/2412.18123
Against The Achilles’ Heel: A Survey on Red Teaming for Generative Models – https://arxiv.org/abs/2404.00629
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast – https://arxiv.org/abs/2402.08567
AgentBreeder: Mitigating the AI Safety Impact of Multi-Agent Scaffolds – https://arxiv.org/abs/2502.00757
Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification – https://arxiv.org/abs/2503.11185
Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models – https://arxiv.org/abs/2506.01307
All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks – https://arxiv.org/abs/2401.09798
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs – https://arxiv.org/abs/2404.07921
Amplified Vulnerabilities: Structured Jailbreak Attacks on LLM-based Multi-Agent Debate – https://arxiv.org/abs/2504.16489
Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak – https://arxiv.org/abs/2312.04127
Antelope: Potent and Concealed Jailbreak Attack Strategy – https://arxiv.org/abs/2412.08156
Are PPO-ed Language Models Hackable? – https://arxiv.org/abs/2406.02577
Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts – https://arxiv.org/abs/2407.15050
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs – https://arxiv.org/abs/2402.11753
Attack Prompt Generation for Red Teaming and Defending Large Language Models – https://arxiv.org/abs/2310.12505
AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models – https://arxiv.org/abs/2401.09002
Attacking Large Language Models with Projected Gradient Descent – https://arxiv.org/abs/2402.09154
AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models – https://arxiv.org/abs/2505.14103
Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models – https://arxiv.org/abs/2501.01830
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs – https://arxiv.org/abs/2410.05295
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models – https://openreview.net/forum?id=7Jwpw4qKkb
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models – https://arxiv.org/abs/2310.15140
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks – https://arxiv.org/abs/2403.04783
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens – https://arxiv.org/abs/2406.03805
Automatic Jailbreaking of the Text-to-Image Generative AI Systems – https://arxiv.org/abs/2405.16567
Automatically Auditing Large Language Models via Discrete Optimization – https://arxiv.org/abs/2303.04381
AutoRAN: Weak-to-Strong Jailbreaking of Large Reasoning Models – https://arxiv.org/abs/2505.10846
Backdoored Retrievers for Prompt Injection Attacks on Retrieval Augmented Generation of Large Language Models – https://arxiv.org/abs/2410.14479
BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge – https://arxiv.org/abs/2503.00596
Badllama 3: removing safety finetuning from Llama 3 in minutes – https://arxiv.org/abs/2407.01376
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs – https://arxiv.org/abs/2406.09324
BAMBA: A Bimodal Adversarial Multi-Round Black-Box Jailbreak Attacker for LVLMs – https://arxiv.org/abs/2412.05892
Baseline Defenses for Adversarial Attacks Against Aligned Language Models – https://arxiv.org/abs/2309.00614
BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger – https://arxiv.org/abs/2408.09093
BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards – https://arxiv.org/abs/2406.01364
Best-of-N Jailbreaking – https://arxiv.org/abs/2412.03556
Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs – https://arxiv.org/abs/2502.19041
Beyond the Tip of Efficiency: Uncovering the Submerged Threats of Jailbreak Attacks in Small Language Models – https://arxiv.org/abs/2502.19883
BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage – https://arxiv.org/abs/2506.02479
BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models – https://arxiv.org/abs/2410.09804
Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement – https://arxiv.org/abs/2402.15180
Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space – https://arxiv.org/abs/2505.21277
Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails – https://arxiv.org/abs/2504.11168
Can a large language model be a gaslighter? – https://arxiv.org/abs/2410.10700
Can Large Language Models Automatically Jailbreak GPT-4V? – https://arxiv.org/abs/2407.16686
Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent – https://arxiv.org/abs/2405.03654
Can Small Language Models Reliably Resist Jailbreak Attacks? A Comprehensive Evaluation – https://arxiv.org/abs/2503.06519
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation – https://openreview.net/forum?id=r42tSSCHPh
CCJA: Context-Coherent Jailbreak Attack for Aligned Large Language Models – https://arxiv.org/abs/2502.11379
Certifying LLM Safety against Adversarial Prompting – https://arxiv.org/abs/2309.02705
Chain of Attack: a Semantic-Driven Contextual Multi-Turn attacker for LLM – https://arxiv.org/abs/2405.05610
Chain-of-Attack: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models – https://arxiv.org/abs/2410.03869
Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models – https://arxiv.org/abs/2505.17519
CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion – https://aclanthology.org/2024.findings-acl.679/
CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models – https://arxiv.org/abs/2402.16717
Coercing LLMs to do and reveal (almost) anything – https://arxiv.org/abs/2402.14020
COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability – https://arxiv.org/abs/2402.08679
Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs – https://arxiv.org/abs/2404.14461
Comprehensive Assessment of Jailbreak Attacks Against LLMs – https://arxiv.org/abs/2402.05668
Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities – https://arxiv.org/abs/2506.00548
Concept Enhancement Engineering: A Lightweight and Efficient Robust Defense Against Jailbreak Attacks in Embodied AI – https://arxiv.org/abs/2504.13201
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming – https://arxiv.org/abs/2501.18837
Continuous Embedding Attacks via Clipped Inputs in Jailbreaking Large Language Models – https://arxiv.org/abs/2407.13796
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation – https://arxiv.org/abs/2406.20053
Cross-Modal Safety Alignment: Is textual unlearning all you need? – https://arxiv.org/abs/2406.02575
Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models – https://arxiv.org/abs/2405.20775
Cross-Task Defense: Instruction-Tuning LLMs for Content Safety – https://arxiv.org/abs/2405.15202
Dark LLMs: The Growing Threat of Unaligned AI Models – https://arxiv.org/abs/2505.10066
DART: Deep Adversarial Automated Red Teaming for LLM Safety – https://arxiv.org/abs/2407.03876
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation – https://arxiv.org/abs/2410.11317
DeepInception: Hypnotize Large Language Model to Be Jailbreaker – https://arxiv.org/abs/2311.03191
Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking via Prompt Evaluation – https://arxiv.org/abs/2502.00580
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM – https://arxiv.org/abs/2309.14348
Defending ChatGPT Against Jailbreak Attack Via Self-Reminder – https://www.researchsquare.com/article/rs-2873090/v1
Defending Jailbreak Attack in VLMs via Cross-modality Information Detector – https://arxiv.org/abs/2407.21659
Defending Large Language Models Against Attacks With Residual Stream Activation Analysis – https://arxiv.org/abs/2406.03230
Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing – https://arxiv.org/abs/2405.18166
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing – https://arxiv.org/abs/2402.16192
Defending LLMs against Jailbreaking Attacks via Backtranslation – https://aclanthology.org/2024.findings-acl.948/
Defending LVLMs Against Vision Attacks through Partial-Perception Supervision – https://arxiv.org/abs/2412.12722
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks – https://arxiv.org/abs/2405.20099
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing – https://arxiv.org/abs/2502.11647
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues – https://arxiv.org/abs/2410.10700
Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models – https://arxiv.org/abs/2408.14853
Detecting Language Model Attacks with Perplexity – https://arxiv.org/abs/2308.14132
Detoxifying Large Language Models via Knowledge Editing – https://arxiv.org/abs/2403.14472
‘Do as I say not as I do’: A Semi-Automated Approach For Jailbreak Prompt Attack Against Multimodal LLMs – https://arxiv.org/html/2502.00735
“Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models – https://arxiv.org/abs/2308.03825
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? – https://arxiv.org/abs/2504.10000
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? – https://arxiv.org/abs/2405.05904
Does Refusal Training in LLMs Generalize to the Past Tense? – https://arxiv.org/abs/2407.11969
Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models – https://arxiv.org/abs/2403.17336
Don’t Say No: Jailbreaking LLM by Suppressing Refusal – https://arxiv.org/abs/2404.16369
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers – https://arxiv.org/abs/2402.16914
DualBreach: Efficient Dual-Jailbreaking via Target-Driven Initialization and Multi-Target Optimization – https://arxiv.org/abs/2504.18564
EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models – https://arxiv.org/abs/2408.11308
Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector – https://arxiv.org/abs/2410.22888
Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs – https://arxiv.org/abs/2409.14866
Efficient Adversarial Training in LLMs with Continuous Attacks – https://arxiv.org/abs/2405.15589
Efficient Jailbreaking of Large Models by Freeze Training: Lower Layers Exhibit Greater Sensitivity to Harmful Content – https://arxiv.org/abs/2502.20952
EigenShield: Causal Subspace Filtering via Random Matrix Theory for Adversarially Robust Vision-Language Models – https://arxiv.org/abs/2502.14976
Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks – https://arxiv.org/abs/2409.00137
Enhancing Jailbreak Attacks via Compliance-Refusal-Based Initialization – https://arxiv.org/abs/2502.09755
Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning – https://arxiv.org/abs/2501.19180
EnJa: Ensemble Jailbreak on Large Language Models – https://arxiv.org/abs/2408.03603
Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge – https://arxiv.org/abs/2404.05880
Evil Geniuses: Delving into the Safety of LLM-based Agents – https://arxiv.org/abs/2311.11855
Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak Attacking – https://arxiv.org/abs/2502.13527
Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks – https://arxiv.org/abs/2302.05733
Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion – https://arxiv.org/abs/2505.14316
Exploring Scaling Trends in LLM Robustness – https://arxiv.org/abs/2407.18213
ExtremeAIGC: Benchmarking LMM Vulnerability to AI-Generated Extremist Content – https://arxiv.org/abs/2503.09964
Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models – https://arxiv.org/abs/2410.15362
FC-Attack: Jailbreaking Large Vision-Language Models via Auto-Generated Flowcharts – https://arxiv.org/abs/2502.21059
Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs – https://arxiv.org/abs/2410.16327
Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models – https://arxiv.org/abs/2407.16205
FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts – https://arxiv.org/abs/2311.05608
FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks – https://arxiv.org/abs/2412.07672
FlipAttack: Jailbreak LLMs via Flipping – https://arxiv.org/abs/2410.02832
from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors – https://arxiv.org/abs/2503.00038
From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy – https://ieeexplore.ieee.org/abstract/document/10198233
From Compliance to Exploitation: Jailbreak Prompt Attacks on Multimodal LLMs – https://arxiv.org/abs/2502.00735
From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings – https://arxiv.org/abs/2402.16006
From LLMs To MLLMs: Exploring The Landscape Of Multimodal Jailbreaking – https://arxiv.org/abs/2406.14859
Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks – https://arxiv.org/abs/2410.04234
FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models – https://arxiv.org/abs/2309.05274
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs – https://arxiv.org/abs/2411.14133
GeneBreaker: Jailbreak Attacks against DNA Language Models with Pathogenicity Guidance – https://arxiv.org/abs/2505.23839
Geneshift: Impact of different scenario shift on Jailbreaking LLM – https://arxiv.org/abs/2504.08104
Goal-guided Generative Prompt Injection Attack on Large Language Models – https://arxiv.org/abs/2404.07234
Goal-Oriented Prompt Attack and Safety Evaluation for LLMs – https://arxiv.org/abs/2309.11830
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher – https://openreview.net/forum?id=MbfAK4s61A
GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation – https://arxiv.org/abs/2405.13077
GPT-4V Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher – https://openreview.net/forum?id=MbfAK4s61A
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts – https://arxiv.org/abs/2309.10253
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes – https://arxiv.org/abs/2403.00867
GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis – https://aclanthology.org/2024.acl-long.30/
Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation – https://arxiv.org/abs/2501.18638
Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs – https://arxiv.org/abs/2504.19019
GraphAttack: Exploiting Representational Blindspots in LLM Safety Mechanisms – https://arxiv.org/abs/2504.13052
Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack – https://arxiv.org/abs/2404.01833
GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models – https://arxiv.org/abs/2402.03299
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning – https://arxiv.org/abs/2505.11049
GuardReasoner: Towards Reasoning-based LLM Safeguards – https://arxiv.org/abs/2501.18492
GuidedBench: Equipping Jailbreak Evaluation with Guidelines – https://arxiv.org/abs/2502.16903
h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment – https://arxiv.org/abs/2408.04811
Hacc-Man: An Arcade Game for Jailbreaking LLMs – https://arxiv.org/abs/2405.15902
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal – https://arxiv.org/abs/2402.04249
Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models – https://arxiv.org/abs/2410.04190
Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models – https://arxiv.org/abs/2412.05934
Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Neural Carrier Articles – https://arxiv.org/abs/2408.11182
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States – https://arxiv.org/abs/2406.05644
How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries – https://arxiv.org/abs/2402.15302
How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation – https://arxiv.org/abs/2502.14486
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs – https://arxiv.org/abs/2401.06373
HSF: Defending against Jailbreak Attacks with Hidden State Filtering – https://arxiv.org/abs/2409.03788
Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything – https://arxiv.org/abs/2407.02534
Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models – https://arxiv.org/abs/2403.09792
ImgTrojan: Jailbreaking Vision-Language Models With ONE Image – https://arxiv.org/abs/2403.02910
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment – https://arxiv.org/abs/2411.18688
Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models – https://arxiv.org/abs/2407.15399
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses – https://arxiv.org/abs/2406.01288
Improved Generation of Adversarial Examples Against Safety-aligned LLMs – https://arxiv.org/abs/2405.20778
Improved Large Language Model Jailbreak Detection via Pretrained Embeddings – https://arxiv.org/abs/2412.01547
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models – https://arxiv.org/abs/2405.21018
Improving Alignment and Robustness with Short Circuiting – https://arxiv.org/abs/2406.04313
Improving LLM Outputs Against Jailbreak Attacks with Expert Model Integration – https://arxiv.org/abs/2505.17066
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models – https://arxiv.org/abs/2411.16769
Inception: Jailbreak the Memory Mechanism of Text-to-Image Generation Systems – https://arxiv.org/abs/2504.20376
Increased LLM Vulnerabilities from Fine-tuning and Quantization – https://arxiv.org/abs/2404.04392
Injecting Universal Jailbreak Backdoors into LLMs in Minutes – https://arxiv.org/abs/2502.10438
Intention Analysis Prompting Makes Large Language Models A Good Jailbreak Defender – https://arxiv.org/abs/2401.06561
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment – https://arxiv.org/abs/2402.14016
Is the System Message Really Important to Jailbreaks in Large Language Models? – https://arxiv.org/abs/2402.14857
JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks – https://arxiv.org/abs/2404.03027
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations – https://arxiv.org/abs/2310.06387
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models – https://arxiv.org/abs/2410.02298
Jailbreak Attacks and Defenses Against Large Language Models: A Survey – https://arxiv.org/abs/2407.04295
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models – https://arxiv.org/abs/2404.01318
Jailbreak Distillation: Renewable Safety Benchmarking – https://arxiv.org/abs/2505.22037
JailbreakEval: An Integrated Safety Evaluator Toolkit for Assessing Jailbreaks Against Large Language Models – https://arxiv.org/abs/2406.09321
Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models – https://openreview.net/forum?id=plmBsXHxgR
Jailbreak is Best Solved by Definition – https://arxiv.org/abs/2403.14725
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit – https://arxiv.org/abs/2411.11114
JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models – https://arxiv.org/abs/2404.08793
JailBreakV: A Benchmark For Assessing The Robustness Of MultiModal Large Language Models Against Jailbreak Attacks – https://arxiv.org/abs/2404.03027
Jailbreak Open-Sourced Large Language Models via Enforced Decoding – https://aclanthology.org/2024.acl-long.299/
Jailbreak Paradox: The Achilles’ Heel of LLMs – https://arxiv.org/abs/2406.12702
Jailbreak Prompt Attack: A Controllable Adversarial Attack against Diffusion Models – https://arxiv.org/abs/2404.02928
Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt – https://arxiv.org/abs/2406.04031
Jailbreaking Attack against Multimodal Large Language Model – https://arxiv.org/abs/2402.02309
Jailbreaking Black Box Large Language Models in Twenty Queries – https://arxiv.org/abs/2310.08419
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study – https://arxiv.org/abs/2305.13860
Jailbreaking Generative AI: Empowering Novices to Conduct Phishing Attacks – https://arxiv.org/abs/2503.01395
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts – https://arxiv.org/abs/2311.09127
Jailbreaking is Best Solved by Definition – https://arxiv.org/abs/2403.14725
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters – https://arxiv.org/abs/2405.20413
Jailbreaking Large Language Models in Infinitely Many Ways – https://arxiv.org/abs/2501.10800
Jailbreaking Large Language Models in Twenty Queries – https://arxiv.org/abs/2310.08419
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks – https://arxiv.org/abs/2404.02151
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency – https://arxiv.org/abs/2501.04931
Jailbreaking Proprietary Large Language Models using Word Substitution Cipher – https://arxiv.org/abs/2402.10601
Jailbreaking Safeguarded Text-to-Image Models via Large Language Models – https://arxiv.org/abs/2503.01839
Jailbreaking Text-to-Image Models with LLM-Based Agents – https://arxiv.org/abs/2408.00523
Jailbreaking with Universal Multi-Prompts – https://arxiv.org/abs/2502.01154
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models – https://arxiv.org/abs/2407.01599
Jailbroken: How Does LLM Safety Training Fail? – https://arxiv.org/abs/2307.02483
JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model – https://arxiv.org/abs/2504.03770
JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMs – https://arxiv.org/abs/2412.15623
JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift – https://arxiv.org/abs/2504.19440
JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models – https://arxiv.org/abs/2505.17568
JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing – https://arxiv.org/abs/2503.08990
JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation – https://arxiv.org/abs/2502.07557
JULI: Jailbreak Large Language Models by Self-Introspection – https://arxiv.org/abs/2505.11790
KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs – https://arxiv.org/abs/2502.05223
Kevin Liu (@kliu128) – “The entire prompt of Microsoft Bing Chat?! (Hi, Sydney.)” – https://x.com/kliu128/status/1623472922374574080
Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack – https://arxiv.org/abs/2406.11682
LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs – https://arxiv.org/abs/2505.10838
Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models – https://arxiv.org/abs/2307.08487
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense – https://arxiv.org/abs/2501.02629
Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks – https://arxiv.org/abs/2402.09177
LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution – https://arxiv.org/abs/2504.01533
LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities – https://arxiv.org/abs/2505.05619
LLM Abuse Prevention Tool Using GCG Jailbreak Attack Detection And DistilBERT-Based Ethics Judgment – https://www.mdpi.com/2078-2489/16/3/204
LLM Censorship: A Machine Learning Challenge Or A Computer Security Problem? – https://arxiv.org/abs/2307.10719
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet – https://arxiv.org/abs/2408.15221
LLM Jailbreak Attack versus Defense Techniques — A Comprehensive Study – https://arxiv.org/abs/2402.13457
LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models – https://arxiv.org/abs/2501.00055
LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper – https://arxiv.org/abs/2402.15727
Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation – https://arxiv.org/abs/2405.13068
Low-Resource Languages Jailbreak GPT-4 – https://arxiv.org/abs/2310.02446
Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization – https://arxiv.org/abs/2503.11750
Making Them a Malicious Database: Exploiting Query Code to Jailbreak Aligned Large Language Models – https://arxiv.org/abs/2502.09723
Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction – https://arxiv.org/abs/2402.18104
Many-shot Jailbreaking – https://cdn.sanity.io/files/4zrzovbb/website/af5633c94ed2beb282f6a53c595eb437e8e7b630.pdf
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming – https://arxiv.org/abs/2311.07689
MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots – https://arxiv.org/abs/2307.08715
Merging Improves Self-Critique Against Jailbreak Attacks – https://arxiv.org/abs/2406.07188
Metaphor-based Jailbreaking Attacks on Text-to-Image Models – https://arxiv.org/abs/2503.17987
Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking – https://arxiv.org/abs/2504.05838
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks – https://arxiv.org/abs/2503.19134
MirrorGuard: Adaptive Defense Against Jailbreaks via Entropy-Guided Mirror Crafting – https://arxiv.org/abs/2503.12931
Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment – https://arxiv.org/abs/2402.14968
Mitigating Many-Shot Jailbreaking – https://arxiv.org/abs/2504.09604
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models – https://arxiv.org/abs/2406.07594
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models – https://arxiv.org/abs/2311.17600
MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models – https://arxiv.org/abs/2408.08464
Model-Editing-Based Jailbreak against Safety-aligned Large Language Models – https://arxiv.org/abs/2412.08201
MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks – https://arxiv.org/abs/2409.17699
“Moralized” Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models – https://arxiv.org/abs/2411.16730
MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue – https://arxiv.org/abs/2411.03814
Multi-step Jailbreaking Privacy Attacks on ChatGPT – https://arxiv.org/abs/2304.05197
Multilingual and Multi-Accent Jailbreaking of Audio LLMs – https://arxiv.org/abs/2504.01094
Multilingual Jailbreak Challenges in Large Language Models – https://openreview.net/forum?id=vESNKdEMGp
Multimodal Pragmatic Jailbreak on Text-to-image Models – https://arxiv.org/abs/2409.19149
No Free Lunch for Defending Against Prefilling Attack by In-Context Learning – https://arxiv.org/abs/2412.12192
No Free Lunch with Guardrails – https://arxiv.org/abs/2504.00441
“Not Aligned” is Not “Malicious”: Being Careful about Hallucinations of Large Language Models’ Jailbreak – https://arxiv.org/abs/2406.11668
On Large Language Models’ Resilience to Coercive Interrogation – https://www.computer.org/csdl/proceedings-article/sp/2024/313000a252/1WPcZ9B0jCg
On Prompt-Driven Safeguarding for Large Language Models – https://arxiv.org/abs/2401.18018
On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs – https://openreview.net/forum?id=H3UayAQWoE
One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs – https://arxiv.org/abs/2505.17598
Open Sesame! Universal Black Box Jailbreaking of Large Language Models – https://arxiv.org/abs/2309.01446
Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms – https://arxiv.org/abs/2503.24191
OWASP Top 10 For Large Language Model Applications – https://owasp.org/www-project-top-10-for-large-language-model-applications/
PAL: Proxy-Guided Black-Box Attack on Large Language Models – https://arxiv.org/abs/2402.09674
PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling – https://arxiv.org/abs/2502.01925
PandaGuard: Systematic Evaluation of LLM Safety in the Era of Jailbreaking Attacks – https://arxiv.org/abs/2505.13862
Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning – https://arxiv.org/abs/2402.08416
PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach – https://arxiv.org/abs/2409.14177
Peering Behind the Shield: Guardrail Identification in Large Language Models – https://arxiv.org/abs/2502.01241
PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning – https://arxiv.org/abs/2411.19335
PiCo: Jailbreaking Multimodal Large Language Models via Pictorial Code Contextualization – https://arxiv.org/abs/2504.01444
PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization – https://arxiv.org/abs/2505.09921
Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues – https://arxiv.org/abs/2402.09091
Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy – https://arxiv.org/abs/2503.20823
Poisoned LangChain: Jailbreak LLMs by LangChain – https://arxiv.org/abs/2406.18122
Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks – https://arxiv.org/abs/2408.08924
Prefill-Based Jailbreak: A Novel Approach of Bypassing LLM Safety Boundary – https://arxiv.org/abs/2504.21038
Preventing Jailbreak Prompts as Malicious Tools for Cybercriminals: A Cyber Defense Perspective – https://arxiv.org/abs/2411.16642
PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing – https://arxiv.org/abs/2407.16318
PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips – https://arxiv.org/abs/2412.07192
Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation – https://arxiv.org/abs/2408.10668
Prompt, Divide, and Conquer: Bypassing Large Language Model Safety Filters via Segmented and Distributed Prompt Processing – https://arxiv.org/abs/2503.21598
Prompt-Driven LLM Safeguarding via Directed Representation Optimization – https://arxiv.org/abs/2401.18018
Protecting Your LLMs with Information Bottleneck – https://arxiv.org/abs/2404.13968
PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails – https://arxiv.org/abs/2402.15911
Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning – https://arxiv.org/abs/2401.10862
PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety – https://arxiv.org/abs/2401.11880
Query-Based Adversarial Prompt Generation – https://arxiv.org/abs/2402.12329
RAIN: Your Language Models Can Align Themselves without Finetuning – https://openreview.net/forum?id=pETSfWMUzy
Rapid Response: Mitigating LLM Jailbreaks with a Few Examples – https://arxiv.org/abs/2411.07494
Read Over the Lines: Attacking LLMs and Toxicity Detection Systems with ASCII Art to Mask Profanity – https://arxiv.org/abs/2409.18708
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models – https://arxiv.org/abs/2502.11054
Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment – https://arxiv.org/abs/2308.09662
RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking – https://arxiv.org/abs/2409.17458
Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? – https://arxiv.org/abs/2404.03411
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent – https://arxiv.org/abs/2407.16667
Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning – https://arxiv.org/abs/2501.13080
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents – https://arxiv.org/abs/2410.13886
ReGA: Representation-Guided Abstraction for Model-based Safeguarding of LLMs – https://arxiv.org/abs/2506.01770
RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process – https://arxiv.org/abs/2410.08660
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content – https://arxiv.org/abs/2403.13031
RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs – https://arxiv.org/abs/2406.08725
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks – https://arxiv.org/abs/2401.17263
RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction – https://arxiv.org/abs/2410.19937
Robustifying Safety-Aligned Large Language Models through Clean Data Curation – https://arxiv.org/abs/2405.19358
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level – https://arxiv.org/abs/2410.06809
RT-Attack: Jailbreaking Text-to-Image Models via Random Token – https://arxiv.org/abs/2408.13896
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks – https://arxiv.org/abs/2407.02855
SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance – https://arxiv.org/abs/2406.18118
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models – https://arxiv.org/abs/2410.18927
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding – https://aclanthology.org/2024.acl-long.303/
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning – https://arxiv.org/abs/2505.16186
SafeText: Safe Text-to-image Models via Aligning the Text Encoder – https://arxiv.org/abs/2502.20623
Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack – https://arxiv.org/abs/2312.06924
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models – https://arxiv.org/abs/2402.02207
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions – https://openreview.net/forum?id=gT5hALch9z
Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs – https://arxiv.org/abs/2501.02018
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming – https://arxiv.org/abs/2408.11851
Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs – https://arxiv.org/abs/2404.07242
SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage – https://arxiv.org/abs/2412.15289
SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models in Chinese – https://arxiv.org/abs/2310.05818
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation – https://arxiv.org/abs/2311.03348
Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval – https://arxiv.org/abs/2505.15753
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner – https://arxiv.org/abs/2406.05498
Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs – https://arxiv.org/abs/2402.14872
SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains – https://arxiv.org/abs/2411.06426
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models – https://arxiv.org/abs/2412.17034
ShieldLearner: A New Paradigm for Jailbreak Attack Defense in LLMs – https://arxiv.org/abs/2502.13162
“Short-length” Adversarial Training Helps LLMs Defend “Long-length” Jailbreak Attacks: Theoretical and Empirical Evidence – https://arxiv.org/abs/2502.04204
Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search – https://arxiv.org/abs/2503.10619
Siren: A Learning-Based Multi-Turn Attack Framework for Simulating Real-World Human Jailbreak Behaviors – https://arxiv.org/abs/2501.14250
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks – https://arxiv.org/abs/2310.03684
SneakyPrompt: Jailbreaking Text-to-image Generative Models – https://arxiv.org/abs/2305.12082
SoK: Prompt Hacking of Large Language Models – https://arxiv.org/abs/2410.13901
SoK: Unifying Cybersecurity and Cybersafety of Multimodal Foundation Models with an Information Theory Approach – https://arxiv.org/abs/2411.11195
SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack – https://arxiv.org/abs/2407.01902
SOS! Soft Prompt Attack Against Open-Source Large Language Models – https://arxiv.org/abs/2407.03160
Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models – https://arxiv.org/abs/2401.10647
SPML: A DSL for Defending Language Models Against Prompt Attacks – https://arxiv.org/abs/2402.11755
Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models – https://arxiv.org/abs/2501.02029
SQL Injection Jailbreak: a structural disaster of large language models – https://arxiv.org/abs/2411.01565
Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks – https://arxiv.org/abs/2503.00187
StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models – https://arxiv.org/abs/2502.11853
StructuralSleight: Automated Jailbreak Attacks on Large Language Models Utilizing Uncommon Text-Encoded Structure – https://arxiv.org/abs/2406.08754
StruQ: Defending Against Prompt Injection with Structured Queries – https://arxiv.org/html/2402.06363v2
Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking – https://arxiv.org/abs/2504.05652
Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild – https://arxiv.org/abs/2311.06237
SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution – https://arxiv.org/abs/2309.14122
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attack – https://arxiv.org/abs/2310.10844
T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models – https://arxiv.org/abs/2504.15512
Take a Look at it! Rethinking How to Evaluate Language Model Jailbreak – https://arxiv.org/abs/2404.06407
Tastle: Distract Large Language Models for Automatic Jailbreak Attack – https://arxiv.org/abs/2403.08424
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game – https://openreview.net/forum?id=fsW7wJGLBd
Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models – https://arxiv.org/abs/2505.22271
The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models – https://arxiv.org/abs/2407.17915
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions – https://arxiv.org/abs/2404.13208
The Jailbreak Tax: How Useful are Your Jailbreak Outputs? – https://arxiv.org/abs/2504.10694
The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense – https://arxiv.org/abs/2411.08410
Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense – https://arxiv.org/abs/2503.11619
Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models – https://arxiv.org/abs/2412.18171
Token-Efficient Prompt Injection Attack: Provoking Cessation in LLM Reasoning via Adaptive Token Compression – https://arxiv.org/abs/2504.20493
Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models – https://arxiv.org/abs/2504.11106
TokenProber: Jailbreaking Text-to-image Models via Fine-grained Word Impact Analysis – https://arxiv.org/abs/2505.08804
ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages – https://aclanthology.org/2024.acl-long.119/
Towards Robust Multimodal Large Language Models Against Jailbreak Attacks – https://arxiv.org/abs/2502.00653
Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare – https://arxiv.org/abs/2501.18632
Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models – https://arxiv.org/abs/2410.23558
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically – https://arxiv.org/abs/2312.02119
Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks – https://arxiv.org/abs/2305.14965
TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice – https://arxiv.org/abs/2502.18504
Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Model Security – https://arxiv.org/abs/2404.05264
Understanding and Enhancing the Transferability of Jailbreaking Attacks – https://arxiv.org/abs/2502.03052
Understanding Hidden Context in Preference Learning: Consequences for RLHF – https://openreview.net/forum?id=0tWTxYYPnW
Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models – https://arxiv.org/abs/2406.09289
Universal Adversarial Triggers Are Not Universal – https://arxiv.org/abs/2404.16020
Universal and Transferable Adversarial Attacks on Aligned Language Models – https://arxiv.org/abs/2307.15043
Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking – https://arxiv.org/abs/2409.08045
Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer – https://arxiv.org/abs/2408.11313
Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks – https://arxiv.org/abs/2406.06302
USB: A Comprehensive and Unified Safety Evaluation Benchmark for Multimodal Large Language Models – https://arxiv.org/abs/2505.23793
Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs – https://arxiv.org/abs/2503.06989
Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection – https://arxiv.org/abs/2406.19845
Visual Adversarial Examples Jailbreak Aligned Large Language Models – https://arxiv.org/abs/2306.13213
Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character – https://arxiv.org/abs/2405.20773
Visual Adversarial Examples Jailbreak Aligned Large Language Models – https://ojs.aaai.org/index.php/AAAI/article/view/30150
VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data – https://arxiv.org/abs/2410.00296
Voice Jailbreak Attacks Against GPT-4o – https://arxiv.org/abs/2405.19103
Weak-to-Strong Jailbreaking on Large Language Models – https://arxiv.org/abs/2401.17256
What Is Jailbreaking In AI models Like ChatGPT? – https://www.techopedia.com/what-is-jailbreaking-in-ai-models-like-chatgpt
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks – https://arxiv.org/abs/2411.03343
What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs – https://arxiv.org/abs/2505.19773
When Do Universal Image Jailbreaks Transfer Between Vision-Language Models? – https://arxiv.org/abs/2407.15211
When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search – https://arxiv.org/abs/2406.08705
When Safety Detectors Aren’t Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques – https://arxiv.org/abs/2505.16765
White-box Multimodal Jailbreaks Against Large Vision-Language Models – https://arxiv.org/abs/2405.17894
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs – https://arxiv.org/abs/2406.18495
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models – https://arxiv.org/abs/2406.18510
X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability – https://arxiv.org/abs/2502.09990
XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs – https://arxiv.org/abs/2504.21700
X-Guard: Multilingual Guard Agent for Content Moderation – https://arxiv.org/abs/2504.08848
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models – https://arxiv.org/abs/2308.01263
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents – https://arxiv.org/abs/2504.13203
You Can’t Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense – https://arxiv.org/abs/2501.12210
You Know What I’m Saying: Jailbreak Attack via Implicit Reference – https://arxiv.org/abs/2410.03857

Final Thoughts

The research above reveals that jailbreaking is not a problem that will be “solved” through a single technical breakthrough. Production-grade protection, as demonstrated by industry experiences defending systems like GPT-4 and Gemini, requires accepting that some attacks may succeed despite best efforts, implementing detection and response capabilities for when defenses fail, and building organizational processes that complement technical safeguards.

Understanding the jailbreaking threat is not merely an academic or technical exercise—it’s fundamental to building trustworthy AI systems that can safely operate in adversarial environments while remaining genuinely useful to legitimate users. The path forward requires not just better defenses, but clearer thinking about what we’re defending, why we’re defending it, and what trade-offs we’re willing to accept in pursuit of AI safety.

Thanks for reading!

The Big List Of AI Jailbreaking References And Resources

Executive Summary

Multi-Modal & New Defenses Expand Attack Landscape

Multi-Agent & Automation Expand Attack Sophistication

“Jailbreak Tax” Proves Unsolvable

Introduction

The Big List Of AI Jailbreaking References And Resources

Final Thoughts

Categories

Join the Next Wave