Note that the below are in alphabetical order by title. Please let me know if there are any sources you would like to see added to this prompt injection and jailbreaking attack resource list. Enjoy!
- A Survey Of Attacks On Large Vision-Language Models: Resources, Advances, And Future Trends – https://arxiv.org/pdf/2407.07403
- Abusing Images And Sounds For Indirect Instruction Injection In Multi-Modal LLMs – https://arxiv.org/abs/2307.10490
- Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks On LLM Agents – https://arxiv.org/abs/2503.00061
- Adversarial Machine Learning – A Taxonomy And Terminology Of Attacks And Mitigations – https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-2e2023.pdf
- Adversarial Reasoning At Jailbreaking Time – https://arxiv.org/html/2502.01633v1
- Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast – https://arxiv.org/abs/2402.08567
- AmpleGCG: Learning A Universal And Transferable Generative Model Of Adversarial Suffixes For Jailbreaking Both Open And Closed LLMs – https://arxiv.org/abs/2404.07921
- Are Aligned Neural Networks Adversarially Aligned? – https://arxiv.org/abs/2306.15447
- Attacking Large Language Models With Projected Gradient Descent – https://arxiv.org/abs/2402.09154
- AutoDAN: Generating Stealthy Jailbreak Prompts On Aligned Large Language Models – https://arxiv.org/abs/2310.04451
- Automatically Auditing Large Language Models Via Discrete Optimization – https://arxiv.org/abs/2303.04381
- AutoPrompt: Eliciting Knowledge From Language Models With Automatically Generated Prompts – https://arxiv.org/abs/2010.15980
- Best-of-N Jailbreaking – https://arxiv.org/abs/2412.03556
- Black Box Adversarial Prompting For Foundation Models – https://arxiv.org/abs/2302.04237
- Can Language Models Be Instructed To Protect Personal Information? – https://arxiv.org/abs/2310.02224
- DeepInception: Hypnotize Large Language Model To Be Jailbreaker – https://arxiv.org/abs/2311.03191
- Defending ChatGPT Against Jailbreak Attack Via Self-Reminder – https://www.researchsquare.com/article/rs-2873090/v1
- Defense Against Prompt Injection Attack By Leveraging Attack Techniques – https://arxiv.org/pdf/2411.00459
- “Do Anything Now”: Characterizing And Evaluating In-The-Wild Jailbreak Prompts On Large Language Models – https://arxiv.org/abs/2308.03825
- ‘Do as I say not as I do’: A Semi-Automated Approach For Jailbreak Prompt Attack Against Multimodal LLMs – https://arxiv.org/html/2502.00735
- Evaluating The Susceptibility Of Pre-Trained Language Models Via Handcrafted Adversarial Examples – https://arxiv.org/abs/2209.02128
- Exploiting Programmatic Behavior Of LLMs: Dual-Use Through Standard Security Attacks – https://arxiv.org/abs/2302.05733
- Explore, Establish, Exploit: Red Teaming Language Models From Scratch – https://arxiv.org/abs/2306.09442
- FigStep: Jailbreaking Large Vision-Language Models Via Typographic Visual Prompts – https://arxiv.org/abs/2311.05608
- From Allies To Adversaries: Manipulating LLM Tool-Calling Through Adversarial Injection – https://arxiv.org/abs/2412.10198
- From LLMs To MLLMs: Exploring The Landscape Of Multimodal Jailbreaking – https://arxiv.org/abs/2406.14859
- Fundamental Limitations Of Alignment In Large Language Models – https://arxiv.org/abs/2304.11082
- GPTFUZZER: Red Teaming Large Language Models With Auto-Generated Jailbreak Prompts – https://arxiv.org/abs/2309.10253
- GPT-4 Is Too Smart To Be Safe: Stealthy Chat With LLMs Via Cipher – https://arxiv.org/abs/2308.06463
- Here Comes The AI Worm: Unleashing Zero-click Worms That Target GenAI-Powered Applications – https://sites.google.com/view/compromptmized
- How Johnny Can Persuade LLMs To Jailbreak Them: Rethinking Persuasion To Challenge AI Safety By Humanizing LLMs – https://arxiv.org/abs/2401.06373
- How We Estimate The Risk From Prompt Injection Attacks On AI Systems – https://security.googleblog.com/2025/01/how-we-estimate-risk-from-prompt.html
- Ignore Previous Prompt: Attack Techniques For Language Models – https://arxiv.org/abs/2211.09527
- Image Hijacks: Adversarial Images Can Control Generative Models At Runtime – https://arxiv.org/abs/2309.00236
- ImgTrojan: Jailbreaking Vision-Language Models With ONE Image – https://arxiv.org/abs/2403.02910
- Jailbreak And Guard Aligned Language Models With Only Few In-Context Demonstrations – https://arxiv.org/abs/2310.06387
- Jailbreak In Pieces: Compositional Adversarial Attacks On Multi-Modal Language Models – https://arxiv.org/abs/2307.14539
- JailBreakV: A Benchmark For Assessing The Robustness Of MultiModal Large Language Models Against Jailbreak Attacks – https://arxiv.org/abs/2404.03027
- Jailbreaking Attack Against Multimodal Large Language Model – https://arxiv.org/abs/2402.02309
- Jailbreaking Black Box Large Language Models In Twenty Queries – https://arxiv.org/abs/2310.08419
- Jailbreaking ChatGPT Via Prompt Engineering: An Empirical Study – https://arxiv.org/abs/2305.13860
- Jailbreaking GPT-4V Via Self-Adversarial Attacks With System Prompts – https://arxiv.org/abs/2311.09127
- Jailbreaking Large Language Models In Infinitely Many Ways – https://arxiv.org/abs/2501.10800
- Jailbreaking Leading Safety-Aligned LLMs With Simple Adaptive Attacks – https://arxiv.org/abs/2404.02151
- Jailbroken: How Does LLM Safety Training Fail? – https://arxiv.org/pdf/2307.02483
- JailDAM: Jailbreak Detection With Adaptive Memory For Vision-Language Model – https://arxiv.org/html/2504.03770v1
- Kevin Liu (@kliu128) – “The entire prompt of Microsoft Bing Chat?! (Hi, Sydney.)” – https://x.com/kliu128/status/1623472922374574080
- LightDefense: A Lightweight Uncertainty-Driven Defense Against Jailbreaks Via Shifted Token Distribution – https://arxiv.org/abs/2504.01533
- LLM Abuse Prevention Tool Using GCG Jailbreak Attack Detection And DistilBERT-Based Ethics Judgment – https://www.mdpi.com/2078-2489/16/3/204
- Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization – https://arxiv.org/abs/2503.11750
- MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots – https://arxiv.org/abs/2307.08715
- Metaphor-based Jailbreaking Attacks On Text-to-Image Models – https://arxiv.org/html/2503.17987v1
- MIRAGE: Multimodal Immersive Reasoning And Guided Exploration For Red-Team Jailbreak Attacks – https://arxiv.org/html/2503.19134v1
- MM-SafetyBench: A Benchmark For Safety Evaluation Of Multimodal Large Language Models – https://arxiv.org/abs/2311.17600
- Multi-modal Prompt Injection Image Attacks Against GPT-4V – https://simonwillison.net/2023/Oct/14/multi-modal-prompt-injection/
- Multi-step Jailbreaking Privacy Attacks On ChatGPT – https://arxiv.org/abs/2304.05197
- Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications With Indirect Prompt Injection – https://arxiv.org/abs/2302.12173
- On Evaluating Adversarial Robustness Of Large Vision-Language Models – https://arxiv.org/abs/2305.16934
- Open Sesame! Universal Black Box Jailbreaking Of Large Language Models – https://arxiv.org/abs/2309.01446
- OWASP Top 10 For Large Language Model Applications – https://owasp.org/www-project-top-10-for-large-language-model-applications/
- Prompt, Divide, And Conquer: Bypassing Large Language Model Safety Filters Via Segmented And Distributed Prompt Processing – https://arxiv.org/abs/2503.21598
- Prompt Injections – https://saif.google/secure-ai-framework/risks
- Prompt Injection Attacks Against GPT-3 – https://simonwillison.net/2022/Sep/12/prompt-injection/
- Prompt Injection Attack Against LLM-integrated Applications – https://arxiv.org/abs/2306.05499
- Prompt Leaking – https://learnprompting.org/docs/prompt_hacking/leaking
- Query-Based Adversarial Prompt Generation – https://arxiv.org/abs/2402.12329
- “Real Attackers Don’t Compute Gradients”: Bridging The Gap Between Adversarial ML Research And Practice – https://arxiv.org/abs/2212.14315
- Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? – https://arxiv.org/abs/2404.03411
- Red Teaming Language Models With Language Models – https://arxiv.org/abs/2202.03286
- REDUCING THE IMPACT OF PROMPT INJECTION ATTACKS THROUGH DESIGN – https://research.kudelskisecurity.com/2023/05/25/reducing-the-impact-of-prompt-injection-attacks-through-design/
- Riley Goodside (@goodside) – “Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions.” – https://x.com/goodside/status/1569128808308957185
- RLPrompt: Optimizing Discrete Text Prompts With Reinforcement Learning – https://arxiv.org/abs/2205.12548
- Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation – https://arxiv.org/abs/2311.03348
- SecAlign: Defending Against Prompt Injection With Preference Optimization – https://arxiv.org/html/2410.05451v2
- Siege: Autonomous Multi-Turn Jailbreaking Of Large Language Models With Tree Search – https://arxiv.org/html/2503.10619v1
- Siren: A Learning-Based Multi-Turn Attack Framework For Simulating Real-World Human Jailbreak Behaviors – https://arxiv.org/abs/2501.14250
- SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks – https://arxiv.org/abs/2310.03684
- Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking – https://arxiv.org/html/2504.05652v1
- Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks Via Adversarial Defense – https://www.researchgate.net/publication/389894870_Tit-for-Tat_Safeguarding_Large_Vision-Language_Models_Against_Jailbreak_Attacks_via_Adversarial_Defense
- Universal And Transferable Adversarial Attacks On Aligned Language Models – https://arxiv.org/abs/2307.15043
- Unveiling The Safety Of GPT-4o: An Empirical Study Using Jailbreak Attacks – https://arxiv.org/abs/2406.06302
- Utilizing Jailbreak Probability To Attack And Safeguard Multimodal LLMs – https://arxiv.org/abs/2503.06989
- Vision-LLMs Can Fool Themselves With Self-Generated Typographic Attacks – https://arxiv.org/abs/2402.00626
- Visual Adversarial Examples Jailbreak Aligned Large Language Models – https://ojs.aaai.org/index.php/AAAI/article/view/30150
- Visual-RolePlay: Universal Jailbreak Attack On MultiModal Large Language Models Via Role-playing Image Character – https://arxiv.org/abs/2405.20773
- Weak-to-Strong Jailbreaking On Large Language Models – https://openreview.net/forum?id=Nazzz5GJ4g
- What Is A Prompt Injection Attack? – https://www.ibm.com/think/topics/prompt-injection
- What Is Jailbreaking In AI models Like ChatGPT? – https://www.techopedia.com/what-is-jailbreaking-in-ai-models-like-chatgpt
- White-box Multimodal Jailbreaks Against Large Vision-Language Models – https://arxiv.org/abs/2405.17894
- Why So Toxic? Measuring And Triggering Toxic Behavior In Open-Domain Chatbots – https://arxiv.org/abs/2209.03463
Thanks for reading!