Brian D. Colwell

Menu
  • Home
  • Blog
  • Contact
Menu

A List Of AI Prompt Injection And Jailbreaking Attack Resources

Posted on June 7, 2025June 7, 2025 by Brian Colwell

Note that the below are in alphabetical order by title. Please let me know if there are any sources you would like to see added to this prompt injection and jailbreaking attack resource list. Enjoy!

  1. A Survey Of Attacks On Large Vision-Language Models: Resources, Advances, And Future Trends – https://arxiv.org/pdf/2407.07403
  2. Abusing Images And Sounds For Indirect Instruction Injection In Multi-Modal LLMs – https://arxiv.org/abs/2307.10490
  3. Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks On LLM Agents – https://arxiv.org/abs/2503.00061
  4. Adversarial Machine Learning – A Taxonomy And Terminology Of Attacks And Mitigations – https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-2e2023.pdf
  5. Adversarial Reasoning At Jailbreaking Time – https://arxiv.org/html/2502.01633v1
  6. Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast – https://arxiv.org/abs/2402.08567
  7. AmpleGCG: Learning A Universal And Transferable Generative Model Of Adversarial Suffixes For Jailbreaking Both Open And Closed LLMs – https://arxiv.org/abs/2404.07921
  8. Are Aligned Neural Networks Adversarially Aligned? – https://arxiv.org/abs/2306.15447
  9. Attacking Large Language Models With Projected Gradient Descent – https://arxiv.org/abs/2402.09154
  10. AutoDAN: Generating Stealthy Jailbreak Prompts On Aligned Large Language Models – https://arxiv.org/abs/2310.04451
  11. Automatically Auditing Large Language Models Via Discrete Optimization – https://arxiv.org/abs/2303.04381
  12. AutoPrompt: Eliciting Knowledge From Language Models With Automatically Generated Prompts – https://arxiv.org/abs/2010.15980
  13. Best-of-N Jailbreaking – https://arxiv.org/abs/2412.03556
  14. Black Box Adversarial Prompting For Foundation Models – https://arxiv.org/abs/2302.04237
  15. Can Language Models Be Instructed To Protect Personal Information? – https://arxiv.org/abs/2310.02224
  16. DeepInception: Hypnotize Large Language Model To Be Jailbreaker – https://arxiv.org/abs/2311.03191
  17. Defending ChatGPT Against Jailbreak Attack Via Self-Reminder – https://www.researchsquare.com/article/rs-2873090/v1
  18. Defense Against Prompt Injection Attack By Leveraging Attack Techniques – https://arxiv.org/pdf/2411.00459
  19. “Do Anything Now”: Characterizing And Evaluating In-The-Wild Jailbreak Prompts On Large Language Models – https://arxiv.org/abs/2308.03825
  20. ‘Do as I say not as I do’: A Semi-Automated Approach For Jailbreak Prompt Attack Against Multimodal LLMs – https://arxiv.org/html/2502.00735
  21. Evaluating The Susceptibility Of Pre-Trained Language Models Via Handcrafted Adversarial Examples – https://arxiv.org/abs/2209.02128
  22. Exploiting Programmatic Behavior Of LLMs: Dual-Use Through Standard Security Attacks – https://arxiv.org/abs/2302.05733
  23. Explore, Establish, Exploit: Red Teaming Language Models From Scratch – https://arxiv.org/abs/2306.09442
  24. FigStep: Jailbreaking Large Vision-Language Models Via Typographic Visual Prompts – https://arxiv.org/abs/2311.05608
  25. From Allies To Adversaries: Manipulating LLM Tool-Calling Through Adversarial Injection – https://arxiv.org/abs/2412.10198
  26. From LLMs To MLLMs: Exploring The Landscape Of Multimodal Jailbreaking – https://arxiv.org/abs/2406.14859
  27. Fundamental Limitations Of Alignment In Large Language Models – ​​https://arxiv.org/abs/2304.11082
  28. GPTFUZZER: Red Teaming Large Language Models With Auto-Generated Jailbreak Prompts – https://arxiv.org/abs/2309.10253
  29. GPT-4 Is Too Smart To Be Safe: Stealthy Chat With LLMs Via Cipher – https://arxiv.org/abs/2308.06463
  30. Here Comes The AI Worm: Unleashing Zero-click Worms That Target GenAI-Powered Applications – https://sites.google.com/view/compromptmized
  31. How Johnny Can Persuade LLMs To Jailbreak Them: Rethinking Persuasion To Challenge AI Safety By Humanizing LLMs – https://arxiv.org/abs/2401.06373
  32. How We Estimate The Risk From Prompt Injection Attacks On AI Systems – https://security.googleblog.com/2025/01/how-we-estimate-risk-from-prompt.html 
  33. Ignore Previous Prompt: Attack Techniques For Language Models – https://arxiv.org/abs/2211.09527 
  34. Image Hijacks: Adversarial Images Can Control Generative Models At Runtime – https://arxiv.org/abs/2309.00236
  35. ImgTrojan: Jailbreaking Vision-Language Models With ONE Image – https://arxiv.org/abs/2403.02910
  36. Jailbreak And Guard Aligned Language Models With Only Few In-Context Demonstrations – https://arxiv.org/abs/2310.06387
  37. Jailbreak In Pieces: Compositional Adversarial Attacks On Multi-Modal Language Models – https://arxiv.org/abs/2307.14539
  38. JailBreakV: A Benchmark For Assessing The Robustness Of MultiModal Large Language Models Against Jailbreak Attacks – https://arxiv.org/abs/2404.03027
  39. Jailbreaking Attack Against Multimodal Large Language Model – https://arxiv.org/abs/2402.02309
  40. Jailbreaking Black Box Large Language Models In Twenty Queries – https://arxiv.org/abs/2310.08419
  41. Jailbreaking ChatGPT Via Prompt Engineering: An Empirical Study – https://arxiv.org/abs/2305.13860
  42. Jailbreaking GPT-4V Via Self-Adversarial Attacks With System Prompts – https://arxiv.org/abs/2311.09127
  43. Jailbreaking Large Language Models In Infinitely Many Ways – https://arxiv.org/abs/2501.10800
  44. Jailbreaking Leading Safety-Aligned LLMs With Simple Adaptive Attacks – ​​https://arxiv.org/abs/2404.02151
  45. Jailbroken: How Does LLM Safety Training Fail? – https://arxiv.org/pdf/2307.02483
  46. JailDAM: Jailbreak Detection With Adaptive Memory For Vision-Language Model – https://arxiv.org/html/2504.03770v1
  47. Kevin Liu (@kliu128) – “The entire prompt of Microsoft Bing Chat?! (Hi, Sydney.)” – https://x.com/kliu128/status/1623472922374574080
  48. LightDefense: A Lightweight Uncertainty-Driven Defense Against Jailbreaks Via Shifted Token Distribution – https://arxiv.org/abs/2504.01533
  49. LLM Abuse Prevention Tool Using GCG Jailbreak Attack Detection And DistilBERT-Based Ethics Judgment – https://www.mdpi.com/2078-2489/16/3/204
  50. Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization – https://arxiv.org/abs/2503.11750
  51. MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots – https://arxiv.org/abs/2307.08715
  52. Metaphor-based Jailbreaking Attacks On Text-to-Image Models – https://arxiv.org/html/2503.17987v1
  53. MIRAGE: Multimodal Immersive Reasoning And Guided Exploration For Red-Team Jailbreak Attacks – https://arxiv.org/html/2503.19134v1
  54. MM-SafetyBench: A Benchmark For Safety Evaluation Of Multimodal Large Language Models – https://arxiv.org/abs/2311.17600
  55. Multi-modal Prompt Injection Image Attacks Against GPT-4V – https://simonwillison.net/2023/Oct/14/multi-modal-prompt-injection/
  56. Multi-step Jailbreaking Privacy Attacks On ChatGPT – https://arxiv.org/abs/2304.05197
  57. Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications With Indirect Prompt Injection – https://arxiv.org/abs/2302.12173 
  58. On Evaluating Adversarial Robustness Of Large Vision-Language Models – https://arxiv.org/abs/2305.16934
  59. Open Sesame! Universal Black Box Jailbreaking Of Large Language Models – https://arxiv.org/abs/2309.01446
  60. OWASP Top 10 For Large Language Model Applications – https://owasp.org/www-project-top-10-for-large-language-model-applications/
  61. Prompt, Divide, And Conquer: Bypassing Large Language Model Safety Filters Via Segmented And Distributed Prompt Processing – https://arxiv.org/abs/2503.21598
  62. Prompt Injections – ​​https://saif.google/secure-ai-framework/risks
  63. Prompt Injection Attacks Against GPT-3 – https://simonwillison.net/2022/Sep/12/prompt-injection/
  64. Prompt Injection Attack Against LLM-integrated Applications – https://arxiv.org/abs/2306.05499
  65. Prompt Leaking – https://learnprompting.org/docs/prompt_hacking/leaking
  66. Query-Based Adversarial Prompt Generation – https://arxiv.org/abs/2402.12329
  67. “Real Attackers Don’t Compute Gradients”: Bridging The Gap Between Adversarial ML Research And Practice – https://arxiv.org/abs/2212.14315 
  68. Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? – https://arxiv.org/abs/2404.03411
  69. Red Teaming Language Models With Language Models – https://arxiv.org/abs/2202.03286
  70. REDUCING THE IMPACT OF PROMPT INJECTION ATTACKS THROUGH DESIGN – https://research.kudelskisecurity.com/2023/05/25/reducing-the-impact-of-prompt-injection-attacks-through-design/
  71. Riley Goodside (@goodside) – “Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions.” – https://x.com/goodside/status/1569128808308957185 
  72. RLPrompt: Optimizing Discrete Text Prompts With Reinforcement Learning – https://arxiv.org/abs/2205.12548
  73. Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation – https://arxiv.org/abs/2311.03348
  74. SecAlign: Defending Against Prompt Injection With Preference Optimization – https://arxiv.org/html/2410.05451v2
  75. Siege: Autonomous Multi-Turn Jailbreaking Of Large Language Models With Tree Search – https://arxiv.org/html/2503.10619v1
  76. Siren: A Learning-Based Multi-Turn Attack Framework For Simulating Real-World Human Jailbreak Behaviors – https://arxiv.org/abs/2501.14250
  77. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks – https://arxiv.org/abs/2310.03684
  78. Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking – https://arxiv.org/html/2504.05652v1
  79. Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks Via Adversarial Defense – https://www.researchgate.net/publication/389894870_Tit-for-Tat_Safeguarding_Large_Vision-Language_Models_Against_Jailbreak_Attacks_via_Adversarial_Defense 
  80. Universal And Transferable Adversarial Attacks On Aligned Language Models – https://arxiv.org/abs/2307.15043
  81. Unveiling The Safety Of GPT-4o: An Empirical Study Using Jailbreak Attacks – https://arxiv.org/abs/2406.06302
  82. Utilizing Jailbreak Probability To Attack And Safeguard Multimodal LLMs – https://arxiv.org/abs/2503.06989
  83. Vision-LLMs Can Fool Themselves With Self-Generated Typographic Attacks – https://arxiv.org/abs/2402.00626
  84. Visual Adversarial Examples Jailbreak Aligned Large Language Models – https://ojs.aaai.org/index.php/AAAI/article/view/30150
  85. Visual-RolePlay: Universal Jailbreak Attack On MultiModal Large Language Models Via Role-playing Image Character – https://arxiv.org/abs/2405.20773
  86. Weak-to-Strong Jailbreaking On Large Language Models – https://openreview.net/forum?id=Nazzz5GJ4g
  87. What Is A Prompt Injection Attack? – https://www.ibm.com/think/topics/prompt-injection
  88. What Is Jailbreaking In AI models Like ChatGPT? – https://www.techopedia.com/what-is-jailbreaking-in-ai-models-like-chatgpt
  89. White-box Multimodal Jailbreaks Against Large Vision-Language Models – https://arxiv.org/abs/2405.17894
  90. Why So Toxic? Measuring And Triggering Toxic Behavior In Open-Domain Chatbots – https://arxiv.org/abs/2209.03463

Thanks for reading!

Browse Topics

  • Artificial Intelligence
    • Adversarial Attacks & Examples
    • Alignment & Ethics
    • Backdoor & Trojan Attacks
    • Federated Learning
    • Model Extraction
    • Prompt Injection & Jailbreaking
    • Watermarking
  • Biotech & Agtech
  • Commodities
    • Agricultural
    • Energies & Energy Metals
    • Gases
    • Gold
    • Industrial Metals
    • Minerals & Metalloids
  • Economics
  • Management
  • Marketing
  • Philosophy
  • Robotics
  • Sociology
    • Group Dynamics
    • Political Science
    • Religious Sociology
    • Sociological Theory
  • Web3 Studies
    • Bitcoin & Cryptocurrencies
    • Blockchain & Cryptography
    • DAOs & Decentralized Organizations
    • NFTs & Digital Identity

Recent Posts

  • A History Of AI Jailbreaking Attacks

    A History Of AI Jailbreaking Attacks

    June 7, 2025
  • What Is AutoAttack? Evaluating Adversarial Robustness

    What Is AutoAttack? Evaluating Adversarial Robustness

    June 7, 2025
  • Introduction To Adversarial Attacks: Typology And Definitions

    Introduction To Adversarial Attacks: Typology And Definitions

    June 7, 2025
©2025 Brian D. Colwell | Theme by SuperbThemes