The Big List Of AI Jailbreaking References And Resources
Executive Summary
This curated collection of references and resources serves as a comprehensive research repository, bringing together academic papers, industry analyses, and empirical evaluations that illuminate the cat-and-mouse game between those who build AI safety mechanisms and those who seek to circumvent them. The materials assembled here span the full spectrum of this dynamic threat domain: from foundational attack techniques and gradient-based optimization methods, to multi-modal exploits, defense frameworks, and the philosophical questions surrounding what constitutes “safe” AI behavior.
Several critical themes emerge from this body of research, revealing the multifaceted nature of the jailbreaking problem:
Multi-Modal & New Defenses Expand Attack Landscape
First, the multi-modal dimension has opened entirely new attack surfaces. Vision-language models face unique vulnerabilities through adversarial images, typographic attacks using ASCII art, embedded instructions in visual prompts, and cross-modal inconsistencies that allow attackers to hide malicious intent in one modality while presenting benign content in another. Audio language models face similar challenges through acoustic adversarial examples and multilingual accent-based exploits.
Next, new attacks beget new defenses beget new attacks, and the jailbreaking landscape reveals a classic security arms race. Early defenses relied on simple keyword filtering and pattern matching, which were quickly defeated by obfuscation techniques, encoding schemes, and linguistic creativity. More sophisticated defenses emerged: embedding-based detection systems, uncertainty-driven defense mechanisms, gradient analysis of safety-critical features, and architectures that separate reasoning about safety from content generation itself.
Multi-Agent & Automation Expand Attack Sophistication
The multi-agent dimension has expanded attack sophistication. As LLMs evolve from isolated question-answering systems into autonomous agents with tool access, memory, and the ability to execute actions across multiple turns, the jailbreaking problem expands exponentially in complexity. Multi-agent systems introduce additional vulnerabilities: prompt infection attacks where one compromised agent can spread jailbreak behaviors to others, cross-agent exploitation where benign-seeming agents can be combined to produce harmful outputs, and emergent behaviors in agent interactions that bypass individual agent safeguards.
And, the democratization of jailbreaking capabilities through automation has accelerated dramatically. What once required expert knowledge of model internals and optimization techniques can now be accomplished through automated red-teaming frameworks, genetic algorithms, and even LLM-based jailbreak generators that can create novel attack prompts with minimal human intervention.
“Jailbreak Tax” Proves Unsolvable
Finally, the tension between safety and utility has proven fundamental and perhaps unsolvable. Many defenses against jailbreaking—such as aggressive content filtering, conservative refusal training, or strict output constraints—come at the cost of reduced helpfulness, increased false positives, and degraded performance on legitimate edge cases. This “jailbreak tax” represents a fundamental trade-off that researchers and developers must carefully navigate.
Introduction
Jailbreaking represents one of the most persistent and evolving challenges in the safety and alignment of large language models. Unlike prompt injection—which exploits the architectural vulnerability of mixing trusted instructions with untrusted data—jailbreaking attacks directly target the safety guardrails that developers have carefully constructed through alignment training, fine-tuning, and reinforcement learning from human feedback (RLHF).
Whether you’re a security researcher investigating LLM vulnerabilities, a developer implementing safety measures in AI-integrated applications, or a practitioner seeking to understand the risk landscape, this resource provides essential context for navigating one of artificial intelligence‘s most pressing security and ethical challenges.
The Big List Of AI Jailbreaking References And Resources
This research corpus reveals several particularly concerning attack vectors: fine-tuning attacks that can remove safety guardrails in minutes, backdoor attacks that embed hidden jailbreak triggers during training, model editing techniques that surgically alter safety-relevant knowledge, and “best-of-N” attacks that simply generate many outputs and select the least safe one—a troublingly effective strategy that highlights vulnerabilities in probabilistic safety guarantees.
Note that the below are in alphabetical order by title. Please let me know if there are any sources you would like to see added to this list. Enjoy!
- A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models – https://arxiv.org/abs/2312.10982
- A Cross-Language Investigation into Jailbreak Attacks in Large Language Models – https://arxiv.org/abs/2401.16765
- A False Sense of Safety: Unsafe Information Leakage in ‘Safe’ AI Responses – https://arxiv.org/abs/2407.02551
- A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares – https://arxiv.org/abs/2408.05061
- A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos – https://arxiv.org/abs/2502.15806
- A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection – https://arxiv.org/abs/2312.10766
- A StrongREJECT for Empty Jailbreaks – https://arxiv.org/abs/2402.10260
- A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations – https://arxiv.org/abs/2502.14881
- A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts Can Fool Large Language Models Easily – https://arxiv.org/abs/2311.08268
- AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs – https://arxiv.org/abs/2409.07503
- AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting – https://arxiv.org/abs/2403.09513
- AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender – https://arxiv.org/abs/2504.09466
- Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models – https://arxiv.org/abs/2408.14866
- Adversarial Attacks on GPT-4 via Simple Random Search – https://www.andriushchenko.me/gpt4adv.pdf
- Adversarial Attacks on LLMs – https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/
- Adversarial Attacks on Large Language Models Using Regularized Relaxation – https://arxiv.org/abs/2410.19160
- Adversarial demonstration attacks on large language models. – https://arxiv.org/abs/2305.14950
- Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs – https://arxiv.org/abs/2502.15427
- Adversarial Reasoning At Jailbreaking Time – https://arxiv.org/html/2502.01633v1
- Adversarial Suffixes May Be Features Too! – https://arxiv.org/abs/2410.00451
- Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs – https://arxiv.org/pdf/2406.06622
- Adversaries Can Misuse Combinations of Safe Models – https://arxiv.org/abs/2406.14595
- Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training – https://arxiv.org/abs/2502.11455
- AdvPrefix: An Objective for Nuanced LLM Jailbreaks – https://arxiv.org/abs/2412.10321
- AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs – https://arxiv.org/abs/2404.16873
- AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models – https://arxiv.org/abs/2412.08608
- AdvWeb: Controllable Black-box Attacks on VLM-powered Web Agents – https://arxiv.org/abs/2410.17401
- AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts – https://arxiv.org/abs/2404.05993
- AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models – https://arxiv.org/abs/2412.18123
- Against The Achilles’ Heel: A Survey on Red Teaming for Generative Models – https://arxiv.org/abs/2404.00629
- Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast – https://arxiv.org/abs/2402.08567
- AgentBreeder: Mitigating the AI Safety Impact of Multi-Agent Scaffolds – https://arxiv.org/abs/2502.00757
- Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification – https://arxiv.org/abs/2503.11185
- Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models – https://arxiv.org/abs/2506.01307
- All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks – https://arxiv.org/abs/2401.09798
- AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs – https://arxiv.org/abs/2404.07921
- Amplified Vulnerabilities: Structured Jailbreak Attacks on LLM-based Multi-Agent Debate – https://arxiv.org/abs/2504.16489
- Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak – https://arxiv.org/abs/2312.04127
- Antelope: Potent and Concealed Jailbreak Attack Strategy – https://arxiv.org/abs/2412.08156
- Are PPO-ed Language Models Hackable? – https://arxiv.org/abs/2406.02577
- Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts – https://arxiv.org/abs/2407.15050
- ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs – https://arxiv.org/abs/2402.11753
- Attack Prompt Generation for Red Teaming and Defending Large Language Models – https://arxiv.org/abs/2310.12505
- AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models – https://arxiv.org/abs/2401.09002
- Attacking Large Language Models with Projected Gradient Descent – https://arxiv.org/abs/2402.09154
- AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models – https://arxiv.org/abs/2505.14103
- Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models – https://arxiv.org/abs/2501.01830
- AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs – https://arxiv.org/abs/2410.05295
- AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models – https://openreview.net/forum?id=7Jwpw4qKkb
- AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models – https://arxiv.org/abs/2310.15140
- AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks – https://arxiv.org/abs/2403.04783
- AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens – https://arxiv.org/abs/2406.03805
- Automatic Jailbreaking of the Text-to-Image Generative AI Systems – https://arxiv.org/abs/2405.16567
- Automatically Auditing Large Language Models via Discrete Optimization – https://arxiv.org/abs/2303.04381
- AutoRAN: Weak-to-Strong Jailbreaking of Large Reasoning Models – https://arxiv.org/abs/2505.10846
- Backdoored Retrievers for Prompt Injection Attacks on Retrieval Augmented Generation of Large Language Models – https://arxiv.org/abs/2410.14479
- BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge – https://arxiv.org/abs/2503.00596
- Badllama 3: removing safety finetuning from Llama 3 in minutes – https://arxiv.org/abs/2407.01376
- Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs – https://arxiv.org/abs/2406.09324
- BAMBA: A Bimodal Adversarial Multi-Round Black-Box Jailbreak Attacker for LVLMs – https://arxiv.org/abs/2412.05892
- Baseline Defenses for Adversarial Attacks Against Aligned Language Models – https://arxiv.org/abs/2309.00614
- BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger – https://arxiv.org/abs/2408.09093
- BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards – https://arxiv.org/abs/2406.01364
- Best-of-N Jailbreaking – https://arxiv.org/abs/2412.03556
- Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs – https://arxiv.org/abs/2502.19041
- Beyond the Tip of Efficiency: Uncovering the Submerged Threats of Jailbreak Attacks in Small Language Models – https://arxiv.org/abs/2502.19883
- BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage – https://arxiv.org/abs/2506.02479
- BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models – https://arxiv.org/abs/2410.09804
- Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement – https://arxiv.org/abs/2402.15180
- Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space – https://arxiv.org/abs/2505.21277
- Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails – https://arxiv.org/abs/2504.11168
- Can a large language model be a gaslighter? – https://arxiv.org/abs/2410.10700
- Can Large Language Models Automatically Jailbreak GPT-4V? – https://arxiv.org/abs/2407.16686
- Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent – https://arxiv.org/abs/2405.03654
- Can Small Language Models Reliably Resist Jailbreak Attacks? A Comprehensive Evaluation – https://arxiv.org/abs/2503.06519
- Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation – https://openreview.net/forum?id=r42tSSCHPh
- CCJA: Context-Coherent Jailbreak Attack for Aligned Large Language Models – https://arxiv.org/abs/2502.11379
- Certifying LLM Safety against Adversarial Prompting – https://arxiv.org/abs/2309.02705
- Chain of Attack: a Semantic-Driven Contextual Multi-Turn attacker for LLM – https://arxiv.org/abs/2405.05610
- Chain-of-Attack: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models – https://arxiv.org/abs/2410.03869
- Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models – https://arxiv.org/abs/2505.17519
- CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion – https://aclanthology.org/2024.findings-acl.679/
- CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models – https://arxiv.org/abs/2402.16717
- Coercing LLMs to do and reveal (almost) anything – https://arxiv.org/abs/2402.14020
- COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability – https://arxiv.org/abs/2402.08679
- Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs – https://arxiv.org/abs/2404.14461
- Comprehensive Assessment of Jailbreak Attacks Against LLMs – https://arxiv.org/abs/2402.05668
- Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities – https://arxiv.org/abs/2506.00548
- Concept Enhancement Engineering: A Lightweight and Efficient Robust Defense Against Jailbreak Attacks in Embodied AI – https://arxiv.org/abs/2504.13201
- Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming – https://arxiv.org/abs/2501.18837
- Continuous Embedding Attacks via Clipped Inputs in Jailbreaking Large Language Models – https://arxiv.org/abs/2407.13796
- Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation – https://arxiv.org/abs/2406.20053
- Cross-Modal Safety Alignment: Is textual unlearning all you need? – https://arxiv.org/abs/2406.02575
- Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models – https://arxiv.org/abs/2405.20775
- Cross-Task Defense: Instruction-Tuning LLMs for Content Safety – https://arxiv.org/abs/2405.15202
- Dark LLMs: The Growing Threat of Unaligned AI Models – https://arxiv.org/abs/2505.10066
- DART: Deep Adversarial Automated Red Teaming for LLM Safety – https://arxiv.org/abs/2407.03876
- Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation – https://arxiv.org/abs/2410.11317
- DeepInception: Hypnotize Large Language Model to Be Jailbreaker – https://arxiv.org/abs/2311.03191
- Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking via Prompt Evaluation – https://arxiv.org/abs/2502.00580
- Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM – https://arxiv.org/abs/2309.14348
- Defending ChatGPT Against Jailbreak Attack Via Self-Reminder – https://www.researchsquare.com/article/rs-2873090/v1
- Defending Jailbreak Attack in VLMs via Cross-modality Information Detector – https://arxiv.org/abs/2407.21659
- Defending Large Language Models Against Attacks With Residual Stream Activation Analysis – https://arxiv.org/abs/2406.03230
- Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing – https://arxiv.org/abs/2405.18166
- Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing – https://arxiv.org/abs/2402.16192
- Defending LLMs against Jailbreaking Attacks via Backtranslation – https://aclanthology.org/2024.findings-acl.948/
- Defending LVLMs Against Vision Attacks through Partial-Perception Supervision – https://arxiv.org/abs/2412.12722
- Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks – https://arxiv.org/abs/2405.20099
- DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing – https://arxiv.org/abs/2502.11647
- Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues – https://arxiv.org/abs/2410.10700
- Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models – https://arxiv.org/abs/2408.14853
- Detecting Language Model Attacks with Perplexity – https://arxiv.org/abs/2308.14132
- Detoxifying Large Language Models via Knowledge Editing – https://arxiv.org/abs/2403.14472
- ‘Do as I say not as I do’: A Semi-Automated Approach For Jailbreak Prompt Attack Against Multimodal LLMs – https://arxiv.org/html/2502.00735
- “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models – https://arxiv.org/abs/2308.03825
- Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? – https://arxiv.org/abs/2504.10000
- Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? – https://arxiv.org/abs/2405.05904
- Does Refusal Training in LLMs Generalize to the Past Tense? – https://arxiv.org/abs/2407.11969
- Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models – https://arxiv.org/abs/2403.17336
- Don’t Say No: Jailbreaking LLM by Suppressing Refusal – https://arxiv.org/abs/2404.16369
- DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers – https://arxiv.org/abs/2402.16914
- DualBreach: Efficient Dual-Jailbreaking via Target-Driven Initialization and Multi-Target Optimization – https://arxiv.org/abs/2504.18564
- EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models – https://arxiv.org/abs/2408.11308
- Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector – https://arxiv.org/abs/2410.22888
- Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs – https://arxiv.org/abs/2409.14866
- Efficient Adversarial Training in LLMs with Continuous Attacks – https://arxiv.org/abs/2405.15589
- Efficient Jailbreaking of Large Models by Freeze Training: Lower Layers Exhibit Greater Sensitivity to Harmful Content – https://arxiv.org/abs/2502.20952
- EigenShield: Causal Subspace Filtering via Random Matrix Theory for Adversarially Robust Vision-Language Models – https://arxiv.org/abs/2502.14976
- Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks – https://arxiv.org/abs/2409.00137
- Enhancing Jailbreak Attacks via Compliance-Refusal-Based Initialization – https://arxiv.org/abs/2502.09755
- Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning – https://arxiv.org/abs/2501.19180
- EnJa: Ensemble Jailbreak on Large Language Models – https://arxiv.org/abs/2408.03603
- Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge – https://arxiv.org/abs/2404.05880
- Evil Geniuses: Delving into the Safety of LLM-based Agents – https://arxiv.org/abs/2311.11855
- Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak Attacking – https://arxiv.org/abs/2502.13527
- Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks – https://arxiv.org/abs/2302.05733
- Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion – https://arxiv.org/abs/2505.14316
- Exploring Scaling Trends in LLM Robustness – https://arxiv.org/abs/2407.18213
- ExtremeAIGC: Benchmarking LMM Vulnerability to AI-Generated Extremist Content – https://arxiv.org/abs/2503.09964
- Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models – https://arxiv.org/abs/2410.15362
- FC-Attack: Jailbreaking Large Vision-Language Models via Auto-Generated Flowcharts – https://arxiv.org/abs/2502.21059
- Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs – https://arxiv.org/abs/2410.16327
- Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models – https://arxiv.org/abs/2407.16205
- FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts – https://arxiv.org/abs/2311.05608
- FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks – https://arxiv.org/abs/2412.07672
- FlipAttack: Jailbreak LLMs via Flipping – https://arxiv.org/abs/2410.02832
- from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors – https://arxiv.org/abs/2503.00038
- From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy – https://ieeexplore.ieee.org/abstract/document/10198233
- From Compliance to Exploitation: Jailbreak Prompt Attacks on Multimodal LLMs – https://arxiv.org/abs/2502.00735
- From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings – https://arxiv.org/abs/2402.16006
- From LLMs To MLLMs: Exploring The Landscape Of Multimodal Jailbreaking – https://arxiv.org/abs/2406.14859
- Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks – https://arxiv.org/abs/2410.04234
- FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models – https://arxiv.org/abs/2309.05274
- GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs – https://arxiv.org/abs/2411.14133
- GeneBreaker: Jailbreak Attacks against DNA Language Models with Pathogenicity Guidance – https://arxiv.org/abs/2505.23839
- Geneshift: Impact of different scenario shift on Jailbreaking LLM – https://arxiv.org/abs/2504.08104
- Goal-guided Generative Prompt Injection Attack on Large Language Models – https://arxiv.org/abs/2404.07234
- Goal-Oriented Prompt Attack and Safety Evaluation for LLMs – https://arxiv.org/abs/2309.11830
- GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher – https://openreview.net/forum?id=MbfAK4s61A
- GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation – https://arxiv.org/abs/2405.13077
- GPT-4V Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher – https://openreview.net/forum?id=MbfAK4s61A
- GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts – https://arxiv.org/abs/2309.10253
- Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes – https://arxiv.org/abs/2403.00867
- GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis – https://aclanthology.org/2024.acl-long.30/
- Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation – https://arxiv.org/abs/2501.18638
- Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs – https://arxiv.org/abs/2504.19019
- GraphAttack: Exploiting Representational Blindspots in LLM Safety Mechanisms – https://arxiv.org/abs/2504.13052
- Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack – https://arxiv.org/abs/2404.01833
- GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models – https://arxiv.org/abs/2402.03299
- GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning – https://arxiv.org/abs/2505.11049
- GuardReasoner: Towards Reasoning-based LLM Safeguards – https://arxiv.org/abs/2501.18492
- GuidedBench: Equipping Jailbreak Evaluation with Guidelines – https://arxiv.org/abs/2502.16903
- h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment – https://arxiv.org/abs/2408.04811
- Hacc-Man: An Arcade Game for Jailbreaking LLMs – https://arxiv.org/abs/2405.15902
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal – https://arxiv.org/abs/2402.04249
- Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models – https://arxiv.org/abs/2410.04190
- Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models – https://arxiv.org/abs/2412.05934
- Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Neural Carrier Articles – https://arxiv.org/abs/2408.11182
- How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States – https://arxiv.org/abs/2406.05644
- How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries – https://arxiv.org/abs/2402.15302
- How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation – https://arxiv.org/abs/2502.14486
- How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs – https://arxiv.org/abs/2401.06373
- HSF: Defending against Jailbreak Attacks with Hidden State Filtering – https://arxiv.org/abs/2409.03788
- Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything – https://arxiv.org/abs/2407.02534
- Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models – https://arxiv.org/abs/2403.09792
- ImgTrojan: Jailbreaking Vision-Language Models With ONE Image – https://arxiv.org/abs/2403.02910
- Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment – https://arxiv.org/abs/2411.18688
- Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models – https://arxiv.org/abs/2407.15399
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses – https://arxiv.org/abs/2406.01288
- Improved Generation of Adversarial Examples Against Safety-aligned LLMs – https://arxiv.org/abs/2405.20778
- Improved Large Language Model Jailbreak Detection via Pretrained Embeddings – https://arxiv.org/abs/2412.01547
- Improved Techniques for Optimization-Based Jailbreaking on Large Language Models – https://arxiv.org/abs/2405.21018
- Improving Alignment and Robustness with Short Circuiting – https://arxiv.org/abs/2406.04313
- Improving LLM Outputs Against Jailbreak Attacks with Expert Model Integration – https://arxiv.org/abs/2505.17066
- In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models – https://arxiv.org/abs/2411.16769
- Inception: Jailbreak the Memory Mechanism of Text-to-Image Generation Systems – https://arxiv.org/abs/2504.20376
- Increased LLM Vulnerabilities from Fine-tuning and Quantization – https://arxiv.org/abs/2404.04392
- Injecting Universal Jailbreak Backdoors into LLMs in Minutes – https://arxiv.org/abs/2502.10438
- Intention Analysis Prompting Makes Large Language Models A Good Jailbreak Defender – https://arxiv.org/abs/2401.06561
- Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment – https://arxiv.org/abs/2402.14016
- Is the System Message Really Important to Jailbreaks in Large Language Models? – https://arxiv.org/abs/2402.14857
- JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks – https://arxiv.org/abs/2404.03027
- Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations – https://arxiv.org/abs/2310.06387
- Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models – https://arxiv.org/abs/2410.02298
- Jailbreak Attacks and Defenses Against Large Language Models: A Survey – https://arxiv.org/abs/2407.04295
- JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models – https://arxiv.org/abs/2404.01318
- Jailbreak Distillation: Renewable Safety Benchmarking – https://arxiv.org/abs/2505.22037
- JailbreakEval: An Integrated Safety Evaluator Toolkit for Assessing Jailbreaks Against Large Language Models – https://arxiv.org/abs/2406.09321
- Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models – https://openreview.net/forum?id=plmBsXHxgR
- Jailbreak is Best Solved by Definition – https://arxiv.org/abs/2403.14725
- JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit – https://arxiv.org/abs/2411.11114
- JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models – https://arxiv.org/abs/2404.08793
- JailBreakV: A Benchmark For Assessing The Robustness Of MultiModal Large Language Models Against Jailbreak Attacks – https://arxiv.org/abs/2404.03027
- Jailbreak Open-Sourced Large Language Models via Enforced Decoding – https://aclanthology.org/2024.acl-long.299/
- Jailbreak Paradox: The Achilles’ Heel of LLMs – https://arxiv.org/abs/2406.12702
- Jailbreak Prompt Attack: A Controllable Adversarial Attack against Diffusion Models – https://arxiv.org/abs/2404.02928
- Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt – https://arxiv.org/abs/2406.04031
- Jailbreaking Attack against Multimodal Large Language Model – https://arxiv.org/abs/2402.02309
- Jailbreaking Black Box Large Language Models in Twenty Queries – https://arxiv.org/abs/2310.08419
- Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study – https://arxiv.org/abs/2305.13860
- Jailbreaking Generative AI: Empowering Novices to Conduct Phishing Attacks – https://arxiv.org/abs/2503.01395
- Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts – https://arxiv.org/abs/2311.09127
- Jailbreaking is Best Solved by Definition – https://arxiv.org/abs/2403.14725
- Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters – https://arxiv.org/abs/2405.20413
- Jailbreaking Large Language Models in Infinitely Many Ways – https://arxiv.org/abs/2501.10800
- Jailbreaking Large Language Models in Twenty Queries – https://arxiv.org/abs/2310.08419
- Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks – https://arxiv.org/abs/2404.02151
- Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency – https://arxiv.org/abs/2501.04931
- Jailbreaking Proprietary Large Language Models using Word Substitution Cipher – https://arxiv.org/abs/2402.10601
- Jailbreaking Safeguarded Text-to-Image Models via Large Language Models – https://arxiv.org/abs/2503.01839
- Jailbreaking Text-to-Image Models with LLM-Based Agents – https://arxiv.org/abs/2408.00523
- Jailbreaking with Universal Multi-Prompts – https://arxiv.org/abs/2502.01154
- JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models – https://arxiv.org/abs/2407.01599
- Jailbroken: How Does LLM Safety Training Fail? – https://arxiv.org/abs/2307.02483
- JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model – https://arxiv.org/abs/2504.03770
- JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMs – https://arxiv.org/abs/2412.15623
- JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift – https://arxiv.org/abs/2504.19440
- JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models – https://arxiv.org/abs/2505.17568
- JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing – https://arxiv.org/abs/2503.08990
- JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation – https://arxiv.org/abs/2502.07557
- JULI: Jailbreak Large Language Models by Self-Introspection – https://arxiv.org/abs/2505.11790
- KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs – https://arxiv.org/abs/2502.05223
- Kevin Liu (@kliu128) – “The entire prompt of Microsoft Bing Chat?! (Hi, Sydney.)” – https://x.com/kliu128/status/1623472922374574080
- Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack – https://arxiv.org/abs/2406.11682
- LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs – https://arxiv.org/abs/2505.10838
- Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models – https://arxiv.org/abs/2307.08487
- Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense – https://arxiv.org/abs/2501.02629
- Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks – https://arxiv.org/abs/2402.09177
- LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution – https://arxiv.org/abs/2504.01533
- LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities – https://arxiv.org/abs/2505.05619
- LLM Abuse Prevention Tool Using GCG Jailbreak Attack Detection And DistilBERT-Based Ethics Judgment – https://www.mdpi.com/2078-2489/16/3/204
- LLM Censorship: A Machine Learning Challenge Or A Computer Security Problem? – https://arxiv.org/abs/2307.10719
- LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet – https://arxiv.org/abs/2408.15221
- LLM Jailbreak Attack versus Defense Techniques — A Comprehensive Study – https://arxiv.org/abs/2402.13457
- LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models – https://arxiv.org/abs/2501.00055
- LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper – https://arxiv.org/abs/2402.15727
- Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation – https://arxiv.org/abs/2405.13068
- Low-Resource Languages Jailbreak GPT-4 – https://arxiv.org/abs/2310.02446
- Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization – https://arxiv.org/abs/2503.11750
- Making Them a Malicious Database: Exploiting Query Code to Jailbreak Aligned Large Language Models – https://arxiv.org/abs/2502.09723
- Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction – https://arxiv.org/abs/2402.18104
- Many-shot Jailbreaking – https://cdn.sanity.io/files/4zrzovbb/website/af5633c94ed2beb282f6a53c595eb437e8e7b630.pdf
- MART: Improving LLM Safety with Multi-round Automatic Red-Teaming – https://arxiv.org/abs/2311.07689
- MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots – https://arxiv.org/abs/2307.08715
- Merging Improves Self-Critique Against Jailbreak Attacks – https://arxiv.org/abs/2406.07188
- Metaphor-based Jailbreaking Attacks on Text-to-Image Models – https://arxiv.org/abs/2503.17987
- Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking – https://arxiv.org/abs/2504.05838
- MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks – https://arxiv.org/abs/2503.19134
- MirrorGuard: Adaptive Defense Against Jailbreaks via Entropy-Guided Mirror Crafting – https://arxiv.org/abs/2503.12931
- Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment – https://arxiv.org/abs/2402.14968
- Mitigating Many-Shot Jailbreaking – https://arxiv.org/abs/2504.09604
- MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models – https://arxiv.org/abs/2406.07594
- MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models – https://arxiv.org/abs/2311.17600
- MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models – https://arxiv.org/abs/2408.08464
- Model-Editing-Based Jailbreak against Safety-aligned Large Language Models – https://arxiv.org/abs/2412.08201
- MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks – https://arxiv.org/abs/2409.17699
- “Moralized” Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models – https://arxiv.org/abs/2411.16730
- MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue – https://arxiv.org/abs/2411.03814
- Multi-step Jailbreaking Privacy Attacks on ChatGPT – https://arxiv.org/abs/2304.05197
- Multilingual and Multi-Accent Jailbreaking of Audio LLMs – https://arxiv.org/abs/2504.01094
- Multilingual Jailbreak Challenges in Large Language Models – https://openreview.net/forum?id=vESNKdEMGp
- Multimodal Pragmatic Jailbreak on Text-to-image Models – https://arxiv.org/abs/2409.19149
- No Free Lunch for Defending Against Prefilling Attack by In-Context Learning – https://arxiv.org/abs/2412.12192
- No Free Lunch with Guardrails – https://arxiv.org/abs/2504.00441
- “Not Aligned” is Not “Malicious”: Being Careful about Hallucinations of Large Language Models’ Jailbreak – https://arxiv.org/abs/2406.11668
- On Large Language Models’ Resilience to Coercive Interrogation – https://www.computer.org/csdl/proceedings-article/sp/2024/313000a252/1WPcZ9B0jCg
- On Prompt-Driven Safeguarding for Large Language Models – https://arxiv.org/abs/2401.18018
- On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs – https://openreview.net/forum?id=H3UayAQWoE
- One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs – https://arxiv.org/abs/2505.17598
- Open Sesame! Universal Black Box Jailbreaking of Large Language Models – https://arxiv.org/abs/2309.01446
- Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms – https://arxiv.org/abs/2503.24191
- OWASP Top 10 For Large Language Model Applications – https://owasp.org/www-project-top-10-for-large-language-model-applications/
- PAL: Proxy-Guided Black-Box Attack on Large Language Models – https://arxiv.org/abs/2402.09674
- PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling – https://arxiv.org/abs/2502.01925
- PandaGuard: Systematic Evaluation of LLM Safety in the Era of Jailbreaking Attacks – https://arxiv.org/abs/2505.13862
- Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning – https://arxiv.org/abs/2402.08416
- PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach – https://arxiv.org/abs/2409.14177
- Peering Behind the Shield: Guardrail Identification in Large Language Models – https://arxiv.org/abs/2502.01241
- PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning – https://arxiv.org/abs/2411.19335
- PiCo: Jailbreaking Multimodal Large Language Models via Pictorial Code Contextualization – https://arxiv.org/abs/2504.01444
- PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization – https://arxiv.org/abs/2505.09921
- Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues – https://arxiv.org/abs/2402.09091
- Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy – https://arxiv.org/abs/2503.20823
- Poisoned LangChain: Jailbreak LLMs by LangChain – https://arxiv.org/abs/2406.18122
- Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks – https://arxiv.org/abs/2408.08924
- Prefill-Based Jailbreak: A Novel Approach of Bypassing LLM Safety Boundary – https://arxiv.org/abs/2504.21038
- Preventing Jailbreak Prompts as Malicious Tools for Cybercriminals: A Cyber Defense Perspective – https://arxiv.org/abs/2411.16642
- PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing – https://arxiv.org/abs/2407.16318
- PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips – https://arxiv.org/abs/2412.07192
- Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation – https://arxiv.org/abs/2408.10668
- Prompt, Divide, and Conquer: Bypassing Large Language Model Safety Filters via Segmented and Distributed Prompt Processing – https://arxiv.org/abs/2503.21598
- Prompt-Driven LLM Safeguarding via Directed Representation Optimization – https://arxiv.org/abs/2401.18018
- Protecting Your LLMs with Information Bottleneck – https://arxiv.org/abs/2404.13968
- PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails – https://arxiv.org/abs/2402.15911
- Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning – https://arxiv.org/abs/2401.10862
- PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety – https://arxiv.org/abs/2401.11880
- Query-Based Adversarial Prompt Generation – https://arxiv.org/abs/2402.12329
- RAIN: Your Language Models Can Align Themselves without Finetuning – https://openreview.net/forum?id=pETSfWMUzy
- Rapid Response: Mitigating LLM Jailbreaks with a Few Examples – https://arxiv.org/abs/2411.07494
- Read Over the Lines: Attacking LLMs and Toxicity Detection Systems with ASCII Art to Mask Profanity – https://arxiv.org/abs/2409.18708
- Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models – https://arxiv.org/abs/2502.11054
- Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment – https://arxiv.org/abs/2308.09662
- RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking – https://arxiv.org/abs/2409.17458
- Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? – https://arxiv.org/abs/2404.03411
- RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent – https://arxiv.org/abs/2407.16667
- Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning – https://arxiv.org/abs/2501.13080
- Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents – https://arxiv.org/abs/2410.13886
- ReGA: Representation-Guided Abstraction for Model-based Safeguarding of LLMs – https://arxiv.org/abs/2506.01770
- RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process – https://arxiv.org/abs/2410.08660
- RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content – https://arxiv.org/abs/2403.13031
- RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs – https://arxiv.org/abs/2406.08725
- Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks – https://arxiv.org/abs/2401.17263
- RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction – https://arxiv.org/abs/2410.19937
- Robustifying Safety-Aligned Large Language Models through Clean Data Curation – https://arxiv.org/abs/2405.19358
- Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level – https://arxiv.org/abs/2410.06809
- RT-Attack: Jailbreaking Text-to-Image Models via Random Token – https://arxiv.org/abs/2408.13896
- Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks – https://arxiv.org/abs/2407.02855
- SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance – https://arxiv.org/abs/2406.18118
- SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models – https://arxiv.org/abs/2410.18927
- SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding – https://aclanthology.org/2024.acl-long.303/
- SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning – https://arxiv.org/abs/2505.16186
- SafeText: Safe Text-to-image Models via Aligning the Text Encoder – https://arxiv.org/abs/2502.20623
- Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack – https://arxiv.org/abs/2312.06924
- Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models – https://arxiv.org/abs/2402.02207
- Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions – https://openreview.net/forum?id=gT5hALch9z
- Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs – https://arxiv.org/abs/2501.02018
- SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming – https://arxiv.org/abs/2408.11851
- Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs – https://arxiv.org/abs/2404.07242
- SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage – https://arxiv.org/abs/2412.15289
- SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models in Chinese – https://arxiv.org/abs/2310.05818
- Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation – https://arxiv.org/abs/2311.03348
- Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval – https://arxiv.org/abs/2505.15753
- SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner – https://arxiv.org/abs/2406.05498
- Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs – https://arxiv.org/abs/2402.14872
- SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains – https://arxiv.org/abs/2411.06426
- Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models – https://arxiv.org/abs/2412.17034
- ShieldLearner: A New Paradigm for Jailbreak Attack Defense in LLMs – https://arxiv.org/abs/2502.13162
- “Short-length” Adversarial Training Helps LLMs Defend “Long-length” Jailbreak Attacks: Theoretical and Empirical Evidence – https://arxiv.org/abs/2502.04204
- Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search – https://arxiv.org/abs/2503.10619
- Siren: A Learning-Based Multi-Turn Attack Framework for Simulating Real-World Human Jailbreak Behaviors – https://arxiv.org/abs/2501.14250
- SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks – https://arxiv.org/abs/2310.03684
- SneakyPrompt: Jailbreaking Text-to-image Generative Models – https://arxiv.org/abs/2305.12082
- SoK: Prompt Hacking of Large Language Models – https://arxiv.org/abs/2410.13901
- SoK: Unifying Cybersecurity and Cybersafety of Multimodal Foundation Models with an Information Theory Approach – https://arxiv.org/abs/2411.11195
- SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack – https://arxiv.org/abs/2407.01902
- SOS! Soft Prompt Attack Against Open-Source Large Language Models – https://arxiv.org/abs/2407.03160
- Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models – https://arxiv.org/abs/2401.10647
- SPML: A DSL for Defending Language Models Against Prompt Attacks – https://arxiv.org/abs/2402.11755
- Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models – https://arxiv.org/abs/2501.02029
- SQL Injection Jailbreak: a structural disaster of large language models – https://arxiv.org/abs/2411.01565
- Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks – https://arxiv.org/abs/2503.00187
- StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models – https://arxiv.org/abs/2502.11853
- StructuralSleight: Automated Jailbreak Attacks on Large Language Models Utilizing Uncommon Text-Encoded Structure – https://arxiv.org/abs/2406.08754
- StruQ: Defending Against Prompt Injection with Structured Queries – https://arxiv.org/html/2402.06363v2
- Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking – https://arxiv.org/abs/2504.05652
- Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild – https://arxiv.org/abs/2311.06237
- SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution – https://arxiv.org/abs/2309.14122
- Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attack – https://arxiv.org/abs/2310.10844
- T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models – https://arxiv.org/abs/2504.15512
- Take a Look at it! Rethinking How to Evaluate Language Model Jailbreak – https://arxiv.org/abs/2404.06407
- Tastle: Distract Large Language Models for Automatic Jailbreak Attack – https://arxiv.org/abs/2403.08424
- Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game – https://openreview.net/forum?id=fsW7wJGLBd
- Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models – https://arxiv.org/abs/2505.22271
- The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models – https://arxiv.org/abs/2407.17915
- The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions – https://arxiv.org/abs/2404.13208
- The Jailbreak Tax: How Useful are Your Jailbreak Outputs? – https://arxiv.org/abs/2504.10694
- The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense – https://arxiv.org/abs/2411.08410
- Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense – https://arxiv.org/abs/2503.11619
- Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models – https://arxiv.org/abs/2412.18171
- Token-Efficient Prompt Injection Attack: Provoking Cessation in LLM Reasoning via Adaptive Token Compression – https://arxiv.org/abs/2504.20493
- Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models – https://arxiv.org/abs/2504.11106
- TokenProber: Jailbreaking Text-to-image Models via Fine-grained Word Impact Analysis – https://arxiv.org/abs/2505.08804
- ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages – https://aclanthology.org/2024.acl-long.119/
- Towards Robust Multimodal Large Language Models Against Jailbreak Attacks – https://arxiv.org/abs/2502.00653
- Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare – https://arxiv.org/abs/2501.18632
- Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models – https://arxiv.org/abs/2410.23558
- Tree of Attacks: Jailbreaking Black-Box LLMs Automatically – https://arxiv.org/abs/2312.02119
- Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks – https://arxiv.org/abs/2305.14965
- TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice – https://arxiv.org/abs/2502.18504
- Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Model Security – https://arxiv.org/abs/2404.05264
- Understanding and Enhancing the Transferability of Jailbreaking Attacks – https://arxiv.org/abs/2502.03052
- Understanding Hidden Context in Preference Learning: Consequences for RLHF – https://openreview.net/forum?id=0tWTxYYPnW
- Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models – https://arxiv.org/abs/2406.09289
- Universal Adversarial Triggers Are Not Universal – https://arxiv.org/abs/2404.16020
- Universal and Transferable Adversarial Attacks on Aligned Language Models – https://arxiv.org/abs/2307.15043
- Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking – https://arxiv.org/abs/2409.08045
- Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer – https://arxiv.org/abs/2408.11313
- Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks – https://arxiv.org/abs/2406.06302
- USB: A Comprehensive and Unified Safety Evaluation Benchmark for Multimodal Large Language Models – https://arxiv.org/abs/2505.23793
- Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs – https://arxiv.org/abs/2503.06989
- Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection – https://arxiv.org/abs/2406.19845
- Visual Adversarial Examples Jailbreak Aligned Large Language Models – https://arxiv.org/abs/2306.13213
- Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character – https://arxiv.org/abs/2405.20773
- Visual Adversarial Examples Jailbreak Aligned Large Language Models – https://ojs.aaai.org/index.php/AAAI/article/view/30150
- VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data – https://arxiv.org/abs/2410.00296
- Voice Jailbreak Attacks Against GPT-4o – https://arxiv.org/abs/2405.19103
- Weak-to-Strong Jailbreaking on Large Language Models – https://arxiv.org/abs/2401.17256
- What Is Jailbreaking In AI models Like ChatGPT? – https://www.techopedia.com/what-is-jailbreaking-in-ai-models-like-chatgpt
- What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks – https://arxiv.org/abs/2411.03343
- What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs – https://arxiv.org/abs/2505.19773
- When Do Universal Image Jailbreaks Transfer Between Vision-Language Models? – https://arxiv.org/abs/2407.15211
- When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search – https://arxiv.org/abs/2406.08705
- When Safety Detectors Aren’t Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques – https://arxiv.org/abs/2505.16765
- White-box Multimodal Jailbreaks Against Large Vision-Language Models – https://arxiv.org/abs/2405.17894
- WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs – https://arxiv.org/abs/2406.18495
- WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models – https://arxiv.org/abs/2406.18510
- X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability – https://arxiv.org/abs/2502.09990
- XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs – https://arxiv.org/abs/2504.21700
- X-Guard: Multilingual Guard Agent for Content Moderation – https://arxiv.org/abs/2504.08848
- XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models – https://arxiv.org/abs/2308.01263
- X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents – https://arxiv.org/abs/2504.13203
- You Can’t Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense – https://arxiv.org/abs/2501.12210
- You Know What I’m Saying: Jailbreak Attack via Implicit Reference – https://arxiv.org/abs/2410.03857
Final Thoughts
The research above reveals that jailbreaking is not a problem that will be “solved” through a single technical breakthrough. Production-grade protection, as demonstrated by industry experiences defending systems like GPT-4 and Gemini, requires accepting that some attacks may succeed despite best efforts, implementing detection and response capabilities for when defenses fail, and building organizational processes that complement technical safeguards.
Understanding the jailbreaking threat is not merely an academic or technical exercise—it’s fundamental to building trustworthy AI systems that can safely operate in adversarial environments while remaining genuinely useful to legitimate users. The path forward requires not just better defenses, but clearer thinking about what we’re defending, why we’re defending it, and what trade-offs we’re willing to accept in pursuit of AI safety.
Thanks for reading!