Brian D. Colwell

Menu
  • Home
  • Blog
  • Contact
Menu

The Big List Of AI Jailbreaking References And Resources

Posted on June 8, 2025June 8, 2025 by Brian Colwell

Note that the below are in alphabetical order by title. Please let me know if there are any sources you would like to see added to this list. Enjoy!

  1. A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models – https://arxiv.org/abs/2312.10982
  2. A Cross-Language Investigation into Jailbreak Attacks in Large Language Models – https://arxiv.org/abs/2401.16765
  3. A False Sense of Safety: Unsafe Information Leakage in ‘Safe’ AI Responses – https://arxiv.org/abs/2407.02551
  4. A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares – https://arxiv.org/abs/2408.05061
  5. A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos – https://arxiv.org/abs/2502.15806
  6. A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection – https://arxiv.org/abs/2312.10766
  7. A StrongREJECT for Empty Jailbreaks – https://arxiv.org/abs/2402.10260
  8. A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations – https://arxiv.org/abs/2502.14881
  9. A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts Can Fool Large Language Models Easily – https://arxiv.org/abs/2311.08268
  10. AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs – https://arxiv.org/abs/2409.07503
  11. AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting – https://arxiv.org/abs/2403.09513
  12. AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender – https://arxiv.org/abs/2504.09466
  13. Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models – https://arxiv.org/abs/2408.14866
  14. Adversarial Attacks on GPT-4 via Simple Random Search – https://www.andriushchenko.me/gpt4adv.pdf
  15. Adversarial Attacks on LLMs – https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/
  16. Adversarial Attacks on Large Language Models Using Regularized Relaxation – https://arxiv.org/abs/2410.19160
  17. Adversarial demonstration attacks on large language models. – https://arxiv.org/abs/2305.14950
  18. Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs – https://arxiv.org/abs/2502.15427
  19. Adversarial Reasoning At Jailbreaking Time – https://arxiv.org/html/2502.01633v1
  20. Adversarial Suffixes May Be Features Too! – https://arxiv.org/abs/2410.00451
  21. Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs – https://arxiv.org/pdf/2406.06622
  22. Adversaries Can Misuse Combinations of Safe Models – https://arxiv.org/abs/2406.14595
  23. Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training – https://arxiv.org/abs/2502.11455
  24. AdvPrefix: An Objective for Nuanced LLM Jailbreaks – https://arxiv.org/abs/2412.10321
  25. AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs – https://arxiv.org/abs/2404.16873
  26. AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models – https://arxiv.org/abs/2412.08608
  27. AdvWeb: Controllable Black-box Attacks on VLM-powered Web Agents – https://arxiv.org/abs/2410.17401
  28. AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts – https://arxiv.org/abs/2404.05993
  29. AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models – https://arxiv.org/abs/2412.18123
  30. Against The Achilles’ Heel: A Survey on Red Teaming for Generative Models – https://arxiv.org/abs/2404.00629
  31. Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast – https://arxiv.org/abs/2402.08567
  32. AgentBreeder: Mitigating the AI Safety Impact of Multi-Agent Scaffolds – https://arxiv.org/abs/2502.00757
  33. Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification – https://arxiv.org/abs/2503.11185
  34. Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models – https://arxiv.org/abs/2506.01307
  35. All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks – https://arxiv.org/abs/2401.09798
  36. AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs – https://arxiv.org/abs/2404.07921
  37. Amplified Vulnerabilities: Structured Jailbreak Attacks on LLM-based Multi-Agent Debate – https://arxiv.org/abs/2504.16489
  38. Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak – https://arxiv.org/abs/2312.04127
  39. Antelope: Potent and Concealed Jailbreak Attack Strategy – https://arxiv.org/abs/2412.08156
  40. Are PPO-ed Language Models Hackable? – https://arxiv.org/abs/2406.02577
  41. Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts – https://arxiv.org/abs/2407.15050
  42. ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs – https://arxiv.org/abs/2402.11753
  43. Attack Prompt Generation for Red Teaming and Defending Large Language Models – https://arxiv.org/abs/2310.12505
  44. AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models – https://arxiv.org/abs/2401.09002
  45. Attacking Large Language Models with Projected Gradient Descent – https://arxiv.org/abs/2402.09154
  46. AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models – https://arxiv.org/abs/2505.14103
  47. Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models – https://arxiv.org/abs/2501.01830
  48. AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs – https://arxiv.org/abs/2410.05295
  49. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models – https://openreview.net/forum?id=7Jwpw4qKkb
  50. AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models – https://arxiv.org/abs/2310.15140
  51. AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks – https://arxiv.org/abs/2403.04783
  52. AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens – https://arxiv.org/abs/2406.03805
  53. Automatic Jailbreaking of the Text-to-Image Generative AI Systems – https://arxiv.org/abs/2405.16567
  54. Automatically Auditing Large Language Models via Discrete Optimization – https://arxiv.org/abs/2303.04381
  55. AutoRAN: Weak-to-Strong Jailbreaking of Large Reasoning Models – https://arxiv.org/abs/2505.10846
  56. Backdoored Retrievers for Prompt Injection Attacks on Retrieval Augmented Generation of Large Language Models – https://arxiv.org/abs/2410.14479
  57. BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge – https://arxiv.org/abs/2503.00596
  58. Badllama 3: removing safety finetuning from Llama 3 in minutes – https://arxiv.org/abs/2407.01376
  59. Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs – https://arxiv.org/abs/2406.09324
  60. BAMBA: A Bimodal Adversarial Multi-Round Black-Box Jailbreak Attacker for LVLMs – https://arxiv.org/abs/2412.05892
  61. Baseline Defenses for Adversarial Attacks Against Aligned Language Models – https://arxiv.org/abs/2309.00614
  62. BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger – https://arxiv.org/abs/2408.09093
  63. BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards – https://arxiv.org/abs/2406.01364
  64. Best-of-N Jailbreaking – https://arxiv.org/abs/2412.03556
  65. Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs – https://arxiv.org/abs/2502.19041
  66. Beyond the Tip of Efficiency: Uncovering the Submerged Threats of Jailbreak Attacks in Small Language Models – https://arxiv.org/abs/2502.19883
  67. BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage – https://arxiv.org/abs/2506.02479
  68. BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models – https://arxiv.org/abs/2410.09804
  69. Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement – https://arxiv.org/abs/2402.15180
  70. Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space – https://arxiv.org/abs/2505.21277
  71. Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails – https://arxiv.org/abs/2504.11168
  72. Can a large language model be a gaslighter? – https://arxiv.org/abs/2410.10700
  73. Can Large Language Models Automatically Jailbreak GPT-4V? – https://arxiv.org/abs/2407.16686
  74. Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent – https://arxiv.org/abs/2405.03654
  75. Can Small Language Models Reliably Resist Jailbreak Attacks? A Comprehensive Evaluation – https://arxiv.org/abs/2503.06519
  76. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation – https://openreview.net/forum?id=r42tSSCHPh
  77. CCJA: Context-Coherent Jailbreak Attack for Aligned Large Language Models – https://arxiv.org/abs/2502.11379
  78. Certifying LLM Safety against Adversarial Prompting – https://arxiv.org/abs/2309.02705
  79. Chain of Attack: a Semantic-Driven Contextual Multi-Turn attacker for LLM – https://arxiv.org/abs/2405.05610
  80. Chain-of-Attack: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models – https://arxiv.org/abs/2410.03869
  81. Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models – https://arxiv.org/abs/2505.17519
  82. CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion – https://aclanthology.org/2024.findings-acl.679/
  83. CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models – https://arxiv.org/abs/2402.16717
  84. Coercing LLMs to do and reveal (almost) anything – https://arxiv.org/abs/2402.14020
  85. COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability – https://arxiv.org/abs/2402.08679
  86. Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs – https://arxiv.org/abs/2404.14461
  87. Comprehensive Assessment of Jailbreak Attacks Against LLMs – https://arxiv.org/abs/2402.05668
  88. Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities – https://arxiv.org/abs/2506.00548
  89. Concept Enhancement Engineering: A Lightweight and Efficient Robust Defense Against Jailbreak Attacks in Embodied AI – https://arxiv.org/abs/2504.13201
  90. Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming – https://arxiv.org/abs/2501.18837
  91. Continuous Embedding Attacks via Clipped Inputs in Jailbreaking Large Language Models – https://arxiv.org/abs/2407.13796
  92. Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation – https://arxiv.org/abs/2406.20053
  93. Cross-Modal Safety Alignment: Is textual unlearning all you need? – https://arxiv.org/abs/2406.02575
  94. Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models – https://arxiv.org/abs/2405.20775
  95. Cross-Task Defense: Instruction-Tuning LLMs for Content Safety – https://arxiv.org/abs/2405.15202
  96. Dark LLMs: The Growing Threat of Unaligned AI Models – https://arxiv.org/abs/2505.10066
  97. DART: Deep Adversarial Automated Red Teaming for LLM Safety – https://arxiv.org/abs/2407.03876
  98. Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation – https://arxiv.org/abs/2410.11317
  99. DeepInception: Hypnotize Large Language Model to Be Jailbreaker – https://arxiv.org/abs/2311.03191
  100. Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking via Prompt Evaluation – https://arxiv.org/abs/2502.00580
  101. Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM – https://arxiv.org/abs/2309.14348
  102. Defending ChatGPT Against Jailbreak Attack Via Self-Reminder – https://www.researchsquare.com/article/rs-2873090/v1
  103. Defending Jailbreak Attack in VLMs via Cross-modality Information Detector – https://arxiv.org/abs/2407.21659
  104. Defending Large Language Models Against Attacks With Residual Stream Activation Analysis – https://arxiv.org/abs/2406.03230
  105. Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing – https://arxiv.org/abs/2405.18166
  106. Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing – https://arxiv.org/abs/2402.16192
  107. Defending LLMs against Jailbreaking Attacks via Backtranslation – https://aclanthology.org/2024.findings-acl.948/
  108. Defending LVLMs Against Vision Attacks through Partial-Perception Supervision – https://arxiv.org/abs/2412.12722
  109. Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks – https://arxiv.org/abs/2405.20099
  110. DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing – https://arxiv.org/abs/2502.11647
  111. Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues – https://arxiv.org/abs/2410.10700
  112. Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models – https://arxiv.org/abs/2408.14853
  113. Detecting Language Model Attacks with Perplexity – https://arxiv.org/abs/2308.14132
  114. Detoxifying Large Language Models via Knowledge Editing – https://arxiv.org/abs/2403.14472
  115. ‘Do as I say not as I do’: A Semi-Automated Approach For Jailbreak Prompt Attack Against Multimodal LLMs – https://arxiv.org/html/2502.00735
  116. “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models – https://arxiv.org/abs/2308.03825
  117. Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? – https://arxiv.org/abs/2504.10000
  118. Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? – https://arxiv.org/abs/2405.05904
  119. Does Refusal Training in LLMs Generalize to the Past Tense? – https://arxiv.org/abs/2407.11969
  120. Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models – https://arxiv.org/abs/2403.17336
  121. Don’t Say No: Jailbreaking LLM by Suppressing Refusal – https://arxiv.org/abs/2404.16369
  122. DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers – https://arxiv.org/abs/2402.16914
  123. DualBreach: Efficient Dual-Jailbreaking via Target-Driven Initialization and Multi-Target Optimization – https://arxiv.org/abs/2504.18564
  124. EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models – https://arxiv.org/abs/2408.11308
  125. Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector – https://arxiv.org/abs/2410.22888
  126. Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs – https://arxiv.org/abs/2409.14866
  127. Efficient Adversarial Training in LLMs with Continuous Attacks – https://arxiv.org/abs/2405.15589
  128. Efficient Jailbreaking of Large Models by Freeze Training: Lower Layers Exhibit Greater Sensitivity to Harmful Content – https://arxiv.org/abs/2502.20952
  129. EigenShield: Causal Subspace Filtering via Random Matrix Theory for Adversarially Robust Vision-Language Models – https://arxiv.org/abs/2502.14976
  130. Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks – https://arxiv.org/abs/2409.00137
  131. Enhancing Jailbreak Attacks via Compliance-Refusal-Based Initialization – https://arxiv.org/abs/2502.09755
  132. Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning – https://arxiv.org/abs/2501.19180
  133. EnJa: Ensemble Jailbreak on Large Language Models – https://arxiv.org/abs/2408.03603
  134. Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge – https://arxiv.org/abs/2404.05880
  135. Evil Geniuses: Delving into the Safety of LLM-based Agents – https://arxiv.org/abs/2311.11855
  136. Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak Attacking – https://arxiv.org/abs/2502.13527
  137. Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks – https://arxiv.org/abs/2302.05733
  138. Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion – https://arxiv.org/abs/2505.14316
  139. Exploring Scaling Trends in LLM Robustness – https://arxiv.org/abs/2407.18213
  140. ExtremeAIGC: Benchmarking LMM Vulnerability to AI-Generated Extremist Content – https://arxiv.org/abs/2503.09964
  141. Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models – https://arxiv.org/abs/2410.15362
  142. FC-Attack: Jailbreaking Large Vision-Language Models via Auto-Generated Flowcharts – https://arxiv.org/abs/2502.21059
  143. Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs – https://arxiv.org/abs/2410.16327
  144. Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models – https://arxiv.org/abs/2407.16205
  145. FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts – https://arxiv.org/abs/2311.05608
  146. FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks – https://arxiv.org/abs/2412.07672
  147. FlipAttack: Jailbreak LLMs via Flipping – https://arxiv.org/abs/2410.02832
  148. from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors – https://arxiv.org/abs/2503.00038
  149. From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy – https://ieeexplore.ieee.org/abstract/document/10198233
  150. From Compliance to Exploitation: Jailbreak Prompt Attacks on Multimodal LLMs – https://arxiv.org/abs/2502.00735
  151. From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings – https://arxiv.org/abs/2402.16006
  152. From LLMs To MLLMs: Exploring The Landscape Of Multimodal Jailbreaking – https://arxiv.org/abs/2406.14859
  153. Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks – https://arxiv.org/abs/2410.04234
  154. FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models – https://arxiv.org/abs/2309.05274
  155. GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs – https://arxiv.org/abs/2411.14133
  156. GeneBreaker: Jailbreak Attacks against DNA Language Models with Pathogenicity Guidance – https://arxiv.org/abs/2505.23839
  157. Geneshift: Impact of different scenario shift on Jailbreaking LLM – https://arxiv.org/abs/2504.08104
  158. Goal-guided Generative Prompt Injection Attack on Large Language Models – https://arxiv.org/abs/2404.07234
  159. Goal-Oriented Prompt Attack and Safety Evaluation for LLMs – https://arxiv.org/abs/2309.11830
  160. GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher – https://openreview.net/forum?id=MbfAK4s61A
  161. GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation – https://arxiv.org/abs/2405.13077
  162. GPT-4V Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher – https://openreview.net/forum?id=MbfAK4s61A
  163. GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts – https://arxiv.org/abs/2309.10253
  164. Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes – https://arxiv.org/abs/2403.00867
  165. GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis – https://aclanthology.org/2024.acl-long.30/
  166. Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation – https://arxiv.org/abs/2501.18638
  167. Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs – https://arxiv.org/abs/2504.19019
  168. GraphAttack: Exploiting Representational Blindspots in LLM Safety Mechanisms – https://arxiv.org/abs/2504.13052
  169. Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack – https://arxiv.org/abs/2404.01833
  170. GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models – https://arxiv.org/abs/2402.03299
  171. GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning – https://arxiv.org/abs/2505.11049
  172. GuardReasoner: Towards Reasoning-based LLM Safeguards – https://arxiv.org/abs/2501.18492
  173. GuidedBench: Equipping Jailbreak Evaluation with Guidelines – https://arxiv.org/abs/2502.16903
  174. h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment – https://arxiv.org/abs/2408.04811
  175. Hacc-Man: An Arcade Game for Jailbreaking LLMs – https://arxiv.org/abs/2405.15902
  176. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal – https://arxiv.org/abs/2402.04249
  177. Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models – https://arxiv.org/abs/2410.04190
  178. Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models – https://arxiv.org/abs/2412.05934
  179. Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Neural Carrier Articles – https://arxiv.org/abs/2408.11182
  180. How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States – https://arxiv.org/abs/2406.05644
  181. How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries – https://arxiv.org/abs/2402.15302
  182. How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation – https://arxiv.org/abs/2502.14486
  183. How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs – https://arxiv.org/abs/2401.06373
  184. HSF: Defending against Jailbreak Attacks with Hidden State Filtering – https://arxiv.org/abs/2409.03788
  185. Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything – https://arxiv.org/abs/2407.02534
  186. Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models – https://arxiv.org/abs/2403.09792
  187. ImgTrojan: Jailbreaking Vision-Language Models With ONE Image – https://arxiv.org/abs/2403.02910
  188. Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment – https://arxiv.org/abs/2411.18688
  189. Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models – https://arxiv.org/abs/2407.15399
  190. Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses – https://arxiv.org/abs/2406.01288
  191. Improved Generation of Adversarial Examples Against Safety-aligned LLMs – https://arxiv.org/abs/2405.20778
  192. Improved Large Language Model Jailbreak Detection via Pretrained Embeddings – https://arxiv.org/abs/2412.01547
  193. Improved Techniques for Optimization-Based Jailbreaking on Large Language Models – https://arxiv.org/abs/2405.21018
  194. Improving Alignment and Robustness with Short Circuiting – https://arxiv.org/abs/2406.04313
  195. Improving LLM Outputs Against Jailbreak Attacks with Expert Model Integration – https://arxiv.org/abs/2505.17066
  196. In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models – https://arxiv.org/abs/2411.16769
  197. Inception: Jailbreak the Memory Mechanism of Text-to-Image Generation Systems – https://arxiv.org/abs/2504.20376
  198. Increased LLM Vulnerabilities from Fine-tuning and Quantization – https://arxiv.org/abs/2404.04392
  199. Injecting Universal Jailbreak Backdoors into LLMs in Minutes – https://arxiv.org/abs/2502.10438
  200. Intention Analysis Prompting Makes Large Language Models A Good Jailbreak Defender – https://arxiv.org/abs/2401.06561
  201. Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment – https://arxiv.org/abs/2402.14016
  202. Is the System Message Really Important to Jailbreaks in Large Language Models? – https://arxiv.org/abs/2402.14857
  203. JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks – https://arxiv.org/abs/2404.03027
  204. Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations – https://arxiv.org/abs/2310.06387
  205. Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models – https://arxiv.org/abs/2410.02298
  206. Jailbreak Attacks and Defenses Against Large Language Models: A Survey – https://arxiv.org/abs/2407.04295
  207. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models – https://arxiv.org/abs/2404.01318
  208. Jailbreak Distillation: Renewable Safety Benchmarking – https://arxiv.org/abs/2505.22037
  209. JailbreakEval: An Integrated Safety Evaluator Toolkit for Assessing Jailbreaks Against Large Language Models – https://arxiv.org/abs/2406.09321
  210. Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models – https://openreview.net/forum?id=plmBsXHxgR
  211. Jailbreak is Best Solved by Definition – https://arxiv.org/abs/2403.14725
  212. JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit – https://arxiv.org/abs/2411.11114
  213. JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models – https://arxiv.org/abs/2404.08793
  214. JailBreakV: A Benchmark For Assessing The Robustness Of MultiModal Large Language Models Against Jailbreak Attacks – https://arxiv.org/abs/2404.03027
  215. Jailbreak Open-Sourced Large Language Models via Enforced Decoding – https://aclanthology.org/2024.acl-long.299/
  216. Jailbreak Paradox: The Achilles’ Heel of LLMs – https://arxiv.org/abs/2406.12702
  217. Jailbreak Prompt Attack: A Controllable Adversarial Attack against Diffusion Models – https://arxiv.org/abs/2404.02928
  218. Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt – https://arxiv.org/abs/2406.04031
  219. Jailbreaking Attack against Multimodal Large Language Model – https://arxiv.org/abs/2402.02309
  220. Jailbreaking Black Box Large Language Models in Twenty Queries – https://arxiv.org/abs/2310.08419
  221. Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study – https://arxiv.org/abs/2305.13860
  222. Jailbreaking Generative AI: Empowering Novices to Conduct Phishing Attacks – https://arxiv.org/abs/2503.01395
  223. Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts – https://arxiv.org/abs/2311.09127
  224. Jailbreaking is Best Solved by Definition – https://arxiv.org/abs/2403.14725
  225. Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters – https://arxiv.org/abs/2405.20413
  226. Jailbreaking Large Language Models in Infinitely Many Ways – https://arxiv.org/abs/2501.10800
  227. Jailbreaking Large Language Models in Twenty Queries – https://arxiv.org/abs/2310.08419
  228. Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks – https://arxiv.org/abs/2404.02151
  229. Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency – https://arxiv.org/abs/2501.04931
  230. Jailbreaking Proprietary Large Language Models using Word Substitution Cipher – https://arxiv.org/abs/2402.10601
  231. Jailbreaking Safeguarded Text-to-Image Models via Large Language Models – https://arxiv.org/abs/2503.01839
  232. Jailbreaking Text-to-Image Models with LLM-Based Agents – https://arxiv.org/abs/2408.00523
  233. Jailbreaking with Universal Multi-Prompts – https://arxiv.org/abs/2502.01154
  234. JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models – https://arxiv.org/abs/2407.01599
  235. Jailbroken: How Does LLM Safety Training Fail? – https://arxiv.org/abs/2307.02483
  236. JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model – https://arxiv.org/abs/2504.03770
  237. JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMs – https://arxiv.org/abs/2412.15623
  238. JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift – https://arxiv.org/abs/2504.19440
  239. JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models – https://arxiv.org/abs/2505.17568
  240. JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing – https://arxiv.org/abs/2503.08990
  241. JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation – https://arxiv.org/abs/2502.07557
  242. JULI: Jailbreak Large Language Models by Self-Introspection – https://arxiv.org/abs/2505.11790
  243. KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs – https://arxiv.org/abs/2502.05223
  244. Kevin Liu (@kliu128) – “The entire prompt of Microsoft Bing Chat?! (Hi, Sydney.)” – https://x.com/kliu128/status/1623472922374574080
  245. Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack – https://arxiv.org/abs/2406.11682
  246. LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs – https://arxiv.org/abs/2505.10838
  247. Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models – https://arxiv.org/abs/2307.08487
  248. Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense – https://arxiv.org/abs/2501.02629
  249. Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks – https://arxiv.org/abs/2402.09177
  250. LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution – https://arxiv.org/abs/2504.01533
  251. LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities – https://arxiv.org/abs/2505.05619
  252. LLM Abuse Prevention Tool Using GCG Jailbreak Attack Detection And DistilBERT-Based Ethics Judgment – https://www.mdpi.com/2078-2489/16/3/204
  253. LLM Censorship: A Machine Learning Challenge Or A Computer Security Problem? – https://arxiv.org/abs/2307.10719
  254. LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet – https://arxiv.org/abs/2408.15221
  255. LLM Jailbreak Attack versus Defense Techniques — A Comprehensive Study – https://arxiv.org/abs/2402.13457
  256. LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models – https://arxiv.org/abs/2501.00055
  257. LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper – https://arxiv.org/abs/2402.15727
  258. Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation – https://arxiv.org/abs/2405.13068
  259. Low-Resource Languages Jailbreak GPT-4 – https://arxiv.org/abs/2310.02446
  260. Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization – https://arxiv.org/abs/2503.11750
  261. Making Them a Malicious Database: Exploiting Query Code to Jailbreak Aligned Large Language Models – https://arxiv.org/abs/2502.09723
  262. Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction – https://arxiv.org/abs/2402.18104
  263. Many-shot Jailbreaking – https://cdn.sanity.io/files/4zrzovbb/website/af5633c94ed2beb282f6a53c595eb437e8e7b630.pdf
  264. MART: Improving LLM Safety with Multi-round Automatic Red-Teaming – https://arxiv.org/abs/2311.07689
  265. MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots – https://arxiv.org/abs/2307.08715
  266. Merging Improves Self-Critique Against Jailbreak Attacks – https://arxiv.org/abs/2406.07188
  267. Metaphor-based Jailbreaking Attacks on Text-to-Image Models – https://arxiv.org/abs/2503.17987
  268. Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking – https://arxiv.org/abs/2504.05838
  269. MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks – https://arxiv.org/abs/2503.19134
  270. MirrorGuard: Adaptive Defense Against Jailbreaks via Entropy-Guided Mirror Crafting – https://arxiv.org/abs/2503.12931
  271. Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment – https://arxiv.org/abs/2402.14968
  272. Mitigating Many-Shot Jailbreaking – https://arxiv.org/abs/2504.09604
  273. MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models – https://arxiv.org/abs/2406.07594
  274. MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models – https://arxiv.org/abs/2311.17600
  275. MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models – https://arxiv.org/abs/2408.08464
  276. Model-Editing-Based Jailbreak against Safety-aligned Large Language Models – https://arxiv.org/abs/2412.08201
  277. MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks – https://arxiv.org/abs/2409.17699
  278. “Moralized” Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models – https://arxiv.org/abs/2411.16730
  279. MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue – https://arxiv.org/abs/2411.03814
  280. Multi-step Jailbreaking Privacy Attacks on ChatGPT – https://arxiv.org/abs/2304.05197
  281. Multilingual and Multi-Accent Jailbreaking of Audio LLMs – https://arxiv.org/abs/2504.01094
  282. Multilingual Jailbreak Challenges in Large Language Models – https://openreview.net/forum?id=vESNKdEMGp
  283. Multimodal Pragmatic Jailbreak on Text-to-image Models – https://arxiv.org/abs/2409.19149
  284. No Free Lunch for Defending Against Prefilling Attack by In-Context Learning – https://arxiv.org/abs/2412.12192
  285. No Free Lunch with Guardrails – https://arxiv.org/abs/2504.00441
  286. “Not Aligned” is Not “Malicious”: Being Careful about Hallucinations of Large Language Models’ Jailbreak – https://arxiv.org/abs/2406.11668
  287. On Large Language Models’ Resilience to Coercive Interrogation – https://www.computer.org/csdl/proceedings-article/sp/2024/313000a252/1WPcZ9B0jCg
  288. On Prompt-Driven Safeguarding for Large Language Models – https://arxiv.org/abs/2401.18018
  289. On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs – https://openreview.net/forum?id=H3UayAQWoE
  290. One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs – https://arxiv.org/abs/2505.17598
  291. Open Sesame! Universal Black Box Jailbreaking of Large Language Models – https://arxiv.org/abs/2309.01446
  292. Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms – https://arxiv.org/abs/2503.24191
  293. OWASP Top 10 For Large Language Model Applications – https://owasp.org/www-project-top-10-for-large-language-model-applications/
  294. PAL: Proxy-Guided Black-Box Attack on Large Language Models – https://arxiv.org/abs/2402.09674
  295. PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling – https://arxiv.org/abs/2502.01925
  296. PandaGuard: Systematic Evaluation of LLM Safety in the Era of Jailbreaking Attacks – https://arxiv.org/abs/2505.13862
  297. Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning – https://arxiv.org/abs/2402.08416
  298. PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach – https://arxiv.org/abs/2409.14177
  299. Peering Behind the Shield: Guardrail Identification in Large Language Models – https://arxiv.org/abs/2502.01241
  300. PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning – https://arxiv.org/abs/2411.19335
  301. PiCo: Jailbreaking Multimodal Large Language Models via Pictorial Code Contextualization – https://arxiv.org/abs/2504.01444
  302. PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization – https://arxiv.org/abs/2505.09921
  303. Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues – https://arxiv.org/abs/2402.09091
  304. Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy – https://arxiv.org/abs/2503.20823
  305. Poisoned LangChain: Jailbreak LLMs by LangChain – https://arxiv.org/abs/2406.18122
  306. Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks – https://arxiv.org/abs/2408.08924
  307. Prefill-Based Jailbreak: A Novel Approach of Bypassing LLM Safety Boundary – https://arxiv.org/abs/2504.21038
  308. Preventing Jailbreak Prompts as Malicious Tools for Cybercriminals: A Cyber Defense Perspective – https://arxiv.org/abs/2411.16642
  309. PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing – https://arxiv.org/abs/2407.16318
  310. PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips – https://arxiv.org/abs/2412.07192
  311. Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation – https://arxiv.org/abs/2408.10668
  312. Prompt, Divide, and Conquer: Bypassing Large Language Model Safety Filters via Segmented and Distributed Prompt Processing – https://arxiv.org/abs/2503.21598
  313. Prompt-Driven LLM Safeguarding via Directed Representation Optimization – https://arxiv.org/abs/2401.18018
  314. Protecting Your LLMs with Information Bottleneck – https://arxiv.org/abs/2404.13968
  315. PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails – https://arxiv.org/abs/2402.15911
  316. Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning – https://arxiv.org/abs/2401.10862
  317. PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety – https://arxiv.org/abs/2401.11880
  318. Query-Based Adversarial Prompt Generation – https://arxiv.org/abs/2402.12329
  319. RAIN: Your Language Models Can Align Themselves without Finetuning – https://openreview.net/forum?id=pETSfWMUzy
  320. Rapid Response: Mitigating LLM Jailbreaks with a Few Examples – https://arxiv.org/abs/2411.07494
  321. Read Over the Lines: Attacking LLMs and Toxicity Detection Systems with ASCII Art to Mask Profanity – https://arxiv.org/abs/2409.18708
  322. Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models – https://arxiv.org/abs/2502.11054
  323. Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment – https://arxiv.org/abs/2308.09662
  324. RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking – https://arxiv.org/abs/2409.17458
  325. Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? – https://arxiv.org/abs/2404.03411
  326. RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent – https://arxiv.org/abs/2407.16667
  327. Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning – https://arxiv.org/abs/2501.13080
  328. Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents – https://arxiv.org/abs/2410.13886
  329. ReGA: Representation-Guided Abstraction for Model-based Safeguarding of LLMs – https://arxiv.org/abs/2506.01770
  330. RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process – https://arxiv.org/abs/2410.08660
  331. RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content – https://arxiv.org/abs/2403.13031
  332. RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs – https://arxiv.org/abs/2406.08725
  333. Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks – https://arxiv.org/abs/2401.17263
  334. RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction – https://arxiv.org/abs/2410.19937
  335. Robustifying Safety-Aligned Large Language Models through Clean Data Curation – https://arxiv.org/abs/2405.19358
  336. Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level – https://arxiv.org/abs/2410.06809
  337. RT-Attack: Jailbreaking Text-to-Image Models via Random Token – https://arxiv.org/abs/2408.13896
  338. Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks – https://arxiv.org/abs/2407.02855
  339. SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance – https://arxiv.org/abs/2406.18118
  340. SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models – https://arxiv.org/abs/2410.18927
  341. SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding – https://aclanthology.org/2024.acl-long.303/
  342. SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning – https://arxiv.org/abs/2505.16186
  343. SafeText: Safe Text-to-image Models via Aligning the Text Encoder – https://arxiv.org/abs/2502.20623
  344. Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack – https://arxiv.org/abs/2312.06924
  345. Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models – https://arxiv.org/abs/2402.02207
  346. Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions – https://openreview.net/forum?id=gT5hALch9z
  347. Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs – https://arxiv.org/abs/2501.02018
  348. SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming – https://arxiv.org/abs/2408.11851
  349. Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs – https://arxiv.org/abs/2404.07242
  350. SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage – https://arxiv.org/abs/2412.15289
  351. SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models in Chinese – https://arxiv.org/abs/2310.05818
  352. Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation – https://arxiv.org/abs/2311.03348
  353. Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval – https://arxiv.org/abs/2505.15753
  354. SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner – https://arxiv.org/abs/2406.05498
  355. Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs – https://arxiv.org/abs/2402.14872
  356. SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains – https://arxiv.org/abs/2411.06426
  357. Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models – https://arxiv.org/abs/2412.17034
  358. ShieldLearner: A New Paradigm for Jailbreak Attack Defense in LLMs – https://arxiv.org/abs/2502.13162
  359. “Short-length” Adversarial Training Helps LLMs Defend “Long-length” Jailbreak Attacks: Theoretical and Empirical Evidence – https://arxiv.org/abs/2502.04204
  360. Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search – https://arxiv.org/abs/2503.10619
  361. Siren: A Learning-Based Multi-Turn Attack Framework for Simulating Real-World Human Jailbreak Behaviors – https://arxiv.org/abs/2501.14250
  362. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks – https://arxiv.org/abs/2310.03684
  363. SneakyPrompt: Jailbreaking Text-to-image Generative Models – https://arxiv.org/abs/2305.12082
  364. SoK: Prompt Hacking of Large Language Models – https://arxiv.org/abs/2410.13901
  365. SoK: Unifying Cybersecurity and Cybersafety of Multimodal Foundation Models with an Information Theory Approach – https://arxiv.org/abs/2411.11195
  366. SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack – https://arxiv.org/abs/2407.01902
  367. SOS! Soft Prompt Attack Against Open-Source Large Language Models – https://arxiv.org/abs/2407.03160
  368. Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models – https://arxiv.org/abs/2401.10647
  369. SPML: A DSL for Defending Language Models Against Prompt Attacks – https://arxiv.org/abs/2402.11755
  370. Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models – https://arxiv.org/abs/2501.02029
  371. SQL Injection Jailbreak: a structural disaster of large language models – https://arxiv.org/abs/2411.01565
  372. Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks – https://arxiv.org/abs/2503.00187
  373. StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models – https://arxiv.org/abs/2502.11853
  374. StructuralSleight: Automated Jailbreak Attacks on Large Language Models Utilizing Uncommon Text-Encoded Structure – https://arxiv.org/abs/2406.08754
  375. StruQ: Defending Against Prompt Injection with Structured Queries – https://arxiv.org/html/2402.06363v2
  376. Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking – https://arxiv.org/abs/2504.05652
  377. Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild – https://arxiv.org/abs/2311.06237
  378. SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution – https://arxiv.org/abs/2309.14122
  379. Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attack – https://arxiv.org/abs/2310.10844
  380. T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models – https://arxiv.org/abs/2504.15512
  381. Take a Look at it! Rethinking How to Evaluate Language Model Jailbreak – https://arxiv.org/abs/2404.06407
  382. Tastle: Distract Large Language Models for Automatic Jailbreak Attack – https://arxiv.org/abs/2403.08424
  383. Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game – https://openreview.net/forum?id=fsW7wJGLBd
  384. Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models – https://arxiv.org/abs/2505.22271
  385. The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models – https://arxiv.org/abs/2407.17915
  386. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions – https://arxiv.org/abs/2404.13208
  387. The Jailbreak Tax: How Useful are Your Jailbreak Outputs? – https://arxiv.org/abs/2504.10694
  388. The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense – https://arxiv.org/abs/2411.08410
  389. Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense – https://arxiv.org/abs/2503.11619
  390. Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models – https://arxiv.org/abs/2412.18171
  391. Token-Efficient Prompt Injection Attack: Provoking Cessation in LLM Reasoning via Adaptive Token Compression – https://arxiv.org/abs/2504.20493
  392. Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models – https://arxiv.org/abs/2504.11106
  393. TokenProber: Jailbreaking Text-to-image Models via Fine-grained Word Impact Analysis – https://arxiv.org/abs/2505.08804
  394. ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages – https://aclanthology.org/2024.acl-long.119/
  395. Towards Robust Multimodal Large Language Models Against Jailbreak Attacks – https://arxiv.org/abs/2502.00653
  396. Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare – https://arxiv.org/abs/2501.18632
  397. Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models – https://arxiv.org/abs/2410.23558
  398. Tree of Attacks: Jailbreaking Black-Box LLMs Automatically – https://arxiv.org/abs/2312.02119
  399. Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks – https://arxiv.org/abs/2305.14965
  400. TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice – https://arxiv.org/abs/2502.18504
  401. Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Model Security – https://arxiv.org/abs/2404.05264
  402. Understanding and Enhancing the Transferability of Jailbreaking Attacks – https://arxiv.org/abs/2502.03052
  403. Understanding Hidden Context in Preference Learning: Consequences for RLHF – https://openreview.net/forum?id=0tWTxYYPnW
  404. Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models – https://arxiv.org/abs/2406.09289
  405. Universal Adversarial Triggers Are Not Universal – https://arxiv.org/abs/2404.16020
  406. Universal and Transferable Adversarial Attacks on Aligned Language Models – https://arxiv.org/abs/2307.15043
  407. Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking – https://arxiv.org/abs/2409.08045
  408. Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer – https://arxiv.org/abs/2408.11313
  409. Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks – https://arxiv.org/abs/2406.06302
  410. USB: A Comprehensive and Unified Safety Evaluation Benchmark for Multimodal Large Language Models – https://arxiv.org/abs/2505.23793
  411. Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs – https://arxiv.org/abs/2503.06989
  412. Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection – https://arxiv.org/abs/2406.19845
  413. Visual Adversarial Examples Jailbreak Aligned Large Language Models – https://arxiv.org/abs/2306.13213
  414. Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character – https://arxiv.org/abs/2405.20773
  415. Visual Adversarial Examples Jailbreak Aligned Large Language Models – https://ojs.aaai.org/index.php/AAAI/article/view/30150
  416. VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data – https://arxiv.org/abs/2410.00296
  417. Voice Jailbreak Attacks Against GPT-4o – https://arxiv.org/abs/2405.19103
  418. Weak-to-Strong Jailbreaking on Large Language Models – https://arxiv.org/abs/2401.17256
  419. What Is Jailbreaking In AI models Like ChatGPT? – https://www.techopedia.com/what-is-jailbreaking-in-ai-models-like-chatgpt
  420. What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks – https://arxiv.org/abs/2411.03343
  421. What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs – https://arxiv.org/abs/2505.19773
  422. When Do Universal Image Jailbreaks Transfer Between Vision-Language Models? – https://arxiv.org/abs/2407.15211
  423. When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search – https://arxiv.org/abs/2406.08705
  424. When Safety Detectors Aren’t Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques – https://arxiv.org/abs/2505.16765
  425. White-box Multimodal Jailbreaks Against Large Vision-Language Models – https://arxiv.org/abs/2405.17894
  426. WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs – https://arxiv.org/abs/2406.18495
  427. WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models – https://arxiv.org/abs/2406.18510
  428. X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability – https://arxiv.org/abs/2502.09990
  429. XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs – https://arxiv.org/abs/2504.21700
  430. X-Guard: Multilingual Guard Agent for Content Moderation – https://arxiv.org/abs/2504.08848
  431. XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models – https://arxiv.org/abs/2308.01263
  432. X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents – https://arxiv.org/abs/2504.13203
  433. You Can’t Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense – https://arxiv.org/abs/2501.12210
  434. You Know What I’m Saying: Jailbreak Attack via Implicit Reference – https://arxiv.org/abs/2410.03857

Thanks for reading!

Browse Topics

  • Artificial Intelligence
    • Adversarial Attacks & Examples
    • Alignment & Ethics
    • Backdoor & Trojan Attacks
    • Data Poisoning
    • Federated Learning
    • Model Extraction
    • Model Inversion
    • Prompt Injection & Jailbreaking
    • Sensitive Information Disclosure
    • Watermarking
  • Biotech & Agtech
  • Commodities
    • Agricultural
    • Energies & Energy Metals
    • Gases
    • Gold
    • Industrial Metals
    • Minerals & Metalloids
  • Economics
  • Management
  • Marketing
  • Philosophy
  • Robotics
  • Sociology
    • Group Dynamics
    • Political Science
    • Religious Sociology
    • Sociological Theory
  • Web3 Studies
    • Bitcoin & Cryptocurrencies
    • Blockchain & Cryptography
    • DAOs & Decentralized Organizations
    • NFTs & Digital Identity

Recent Posts

  • A Taxonomy Of AI Data Poisoning Defenses

    A Taxonomy Of AI Data Poisoning Defenses

    June 8, 2025
  • The Big List Of AI Data Poisoning Attack And Defense References And Resources 

    The Big List Of AI Data Poisoning Attack And Defense References And Resources 

    June 8, 2025
  • What Are AI Sensitive Information Disclosure Attacks? The Threat Landscape

    What Are AI Sensitive Information Disclosure Attacks? The Threat Landscape

    June 8, 2025
©2025 Brian D. Colwell | Theme by SuperbThemes