A List Of AI Prompt Injection And Jailbreaking Attack Resources

Note that the below are in alphabetical order by title. Please let me know if there are any sources you would like to see added to this prompt injection and jailbreaking attack resource list. Enjoy!

A Survey Of Attacks On Large Vision-Language Models: Resources, Advances, And Future Trends – https://arxiv.org/pdf/2407.07403
Abusing Images And Sounds For Indirect Instruction Injection In Multi-Modal LLMs – https://arxiv.org/abs/2307.10490
Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks On LLM Agents – https://arxiv.org/abs/2503.00061
Adversarial Machine Learning – A Taxonomy And Terminology Of Attacks And Mitigations – https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-2e2023.pdf
Adversarial Reasoning At Jailbreaking Time – https://arxiv.org/html/2502.01633v1
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast – https://arxiv.org/abs/2402.08567
AmpleGCG: Learning A Universal And Transferable Generative Model Of Adversarial Suffixes For Jailbreaking Both Open And Closed LLMs – https://arxiv.org/abs/2404.07921
Are Aligned Neural Networks Adversarially Aligned? – https://arxiv.org/abs/2306.15447
Attacking Large Language Models With Projected Gradient Descent – https://arxiv.org/abs/2402.09154
AutoDAN: Generating Stealthy Jailbreak Prompts On Aligned Large Language Models – https://arxiv.org/abs/2310.04451
Automatically Auditing Large Language Models Via Discrete Optimization – https://arxiv.org/abs/2303.04381
AutoPrompt: Eliciting Knowledge From Language Models With Automatically Generated Prompts – https://arxiv.org/abs/2010.15980
Best-of-N Jailbreaking – https://arxiv.org/abs/2412.03556
Black Box Adversarial Prompting For Foundation Models – https://arxiv.org/abs/2302.04237
Can Language Models Be Instructed To Protect Personal Information? – https://arxiv.org/abs/2310.02224
DeepInception: Hypnotize Large Language Model To Be Jailbreaker – https://arxiv.org/abs/2311.03191
Defending ChatGPT Against Jailbreak Attack Via Self-Reminder – https://www.researchsquare.com/article/rs-2873090/v1
Defense Against Prompt Injection Attack By Leveraging Attack Techniques – https://arxiv.org/pdf/2411.00459
“Do Anything Now”: Characterizing And Evaluating In-The-Wild Jailbreak Prompts On Large Language Models – https://arxiv.org/abs/2308.03825
‘Do as I say not as I do’: A Semi-Automated Approach For Jailbreak Prompt Attack Against Multimodal LLMs – https://arxiv.org/html/2502.00735
Evaluating The Susceptibility Of Pre-Trained Language Models Via Handcrafted Adversarial Examples – https://arxiv.org/abs/2209.02128
Exploiting Programmatic Behavior Of LLMs: Dual-Use Through Standard Security Attacks – https://arxiv.org/abs/2302.05733
Explore, Establish, Exploit: Red Teaming Language Models From Scratch – https://arxiv.org/abs/2306.09442
FigStep: Jailbreaking Large Vision-Language Models Via Typographic Visual Prompts – https://arxiv.org/abs/2311.05608
From Allies To Adversaries: Manipulating LLM Tool-Calling Through Adversarial Injection – https://arxiv.org/abs/2412.10198
From LLMs To MLLMs: Exploring The Landscape Of Multimodal Jailbreaking – https://arxiv.org/abs/2406.14859
Fundamental Limitations Of Alignment In Large Language Models – https://arxiv.org/abs/2304.11082
GPTFUZZER: Red Teaming Large Language Models With Auto-Generated Jailbreak Prompts – https://arxiv.org/abs/2309.10253
GPT-4 Is Too Smart To Be Safe: Stealthy Chat With LLMs Via Cipher – https://arxiv.org/abs/2308.06463
Here Comes The AI Worm: Unleashing Zero-click Worms That Target GenAI-Powered Applications – https://sites.google.com/view/compromptmized
How Johnny Can Persuade LLMs To Jailbreak Them: Rethinking Persuasion To Challenge AI Safety By Humanizing LLMs – https://arxiv.org/abs/2401.06373
How We Estimate The Risk From Prompt Injection Attacks On AI Systems – https://security.googleblog.com/2025/01/how-we-estimate-risk-from-prompt.html
Ignore Previous Prompt: Attack Techniques For Language Models – https://arxiv.org/abs/2211.09527
Image Hijacks: Adversarial Images Can Control Generative Models At Runtime – https://arxiv.org/abs/2309.00236
ImgTrojan: Jailbreaking Vision-Language Models With ONE Image – https://arxiv.org/abs/2403.02910
Jailbreak And Guard Aligned Language Models With Only Few In-Context Demonstrations – https://arxiv.org/abs/2310.06387
Jailbreak In Pieces: Compositional Adversarial Attacks On Multi-Modal Language Models – https://arxiv.org/abs/2307.14539
JailBreakV: A Benchmark For Assessing The Robustness Of MultiModal Large Language Models Against Jailbreak Attacks – https://arxiv.org/abs/2404.03027
Jailbreaking Attack Against Multimodal Large Language Model – https://arxiv.org/abs/2402.02309
Jailbreaking Black Box Large Language Models In Twenty Queries – https://arxiv.org/abs/2310.08419
Jailbreaking ChatGPT Via Prompt Engineering: An Empirical Study – https://arxiv.org/abs/2305.13860
Jailbreaking GPT-4V Via Self-Adversarial Attacks With System Prompts – https://arxiv.org/abs/2311.09127
Jailbreaking Large Language Models In Infinitely Many Ways – https://arxiv.org/abs/2501.10800
Jailbreaking Leading Safety-Aligned LLMs With Simple Adaptive Attacks – https://arxiv.org/abs/2404.02151
Jailbroken: How Does LLM Safety Training Fail? – https://arxiv.org/pdf/2307.02483
JailDAM: Jailbreak Detection With Adaptive Memory For Vision-Language Model – https://arxiv.org/html/2504.03770v1
Kevin Liu (@kliu128) – “The entire prompt of Microsoft Bing Chat?! (Hi, Sydney.)” – https://x.com/kliu128/status/1623472922374574080
LightDefense: A Lightweight Uncertainty-Driven Defense Against Jailbreaks Via Shifted Token Distribution – https://arxiv.org/abs/2504.01533
LLM Abuse Prevention Tool Using GCG Jailbreak Attack Detection And DistilBERT-Based Ethics Judgment – https://www.mdpi.com/2078-2489/16/3/204
Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization – https://arxiv.org/abs/2503.11750
MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots – https://arxiv.org/abs/2307.08715
Metaphor-based Jailbreaking Attacks On Text-to-Image Models – https://arxiv.org/html/2503.17987v1
MIRAGE: Multimodal Immersive Reasoning And Guided Exploration For Red-Team Jailbreak Attacks – https://arxiv.org/html/2503.19134v1
MM-SafetyBench: A Benchmark For Safety Evaluation Of Multimodal Large Language Models – https://arxiv.org/abs/2311.17600
Multi-modal Prompt Injection Image Attacks Against GPT-4V – https://simonwillison.net/2023/Oct/14/multi-modal-prompt-injection/
Multi-step Jailbreaking Privacy Attacks On ChatGPT – https://arxiv.org/abs/2304.05197
Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications With Indirect Prompt Injection – https://arxiv.org/abs/2302.12173
On Evaluating Adversarial Robustness Of Large Vision-Language Models – https://arxiv.org/abs/2305.16934
Open Sesame! Universal Black Box Jailbreaking Of Large Language Models – https://arxiv.org/abs/2309.01446
OWASP Top 10 For Large Language Model Applications – https://owasp.org/www-project-top-10-for-large-language-model-applications/
Prompt, Divide, And Conquer: Bypassing Large Language Model Safety Filters Via Segmented And Distributed Prompt Processing – https://arxiv.org/abs/2503.21598
Prompt Injections – https://saif.google/secure-ai-framework/risks
Prompt Injection Attacks Against GPT-3 – https://simonwillison.net/2022/Sep/12/prompt-injection/
Prompt Injection Attack Against LLM-integrated Applications – https://arxiv.org/abs/2306.05499
Prompt Leaking – https://learnprompting.org/docs/prompt_hacking/leaking
Query-Based Adversarial Prompt Generation – https://arxiv.org/abs/2402.12329
“Real Attackers Don’t Compute Gradients”: Bridging The Gap Between Adversarial ML Research And Practice – https://arxiv.org/abs/2212.14315
Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? – https://arxiv.org/abs/2404.03411
Red Teaming Language Models With Language Models – https://arxiv.org/abs/2202.03286
REDUCING THE IMPACT OF PROMPT INJECTION ATTACKS THROUGH DESIGN – https://research.kudelskisecurity.com/2023/05/25/reducing-the-impact-of-prompt-injection-attacks-through-design/
Riley Goodside (@goodside) – “Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions.” – https://x.com/goodside/status/1569128808308957185
RLPrompt: Optimizing Discrete Text Prompts With Reinforcement Learning – https://arxiv.org/abs/2205.12548
Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation – https://arxiv.org/abs/2311.03348
SecAlign: Defending Against Prompt Injection With Preference Optimization – https://arxiv.org/html/2410.05451v2
Siege: Autonomous Multi-Turn Jailbreaking Of Large Language Models With Tree Search – https://arxiv.org/html/2503.10619v1
Siren: A Learning-Based Multi-Turn Attack Framework For Simulating Real-World Human Jailbreak Behaviors – https://arxiv.org/abs/2501.14250
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks – https://arxiv.org/abs/2310.03684
Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking – https://arxiv.org/html/2504.05652v1
Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks Via Adversarial Defense – https://www.researchgate.net/publication/389894870_Tit-for-Tat_Safeguarding_Large_Vision-Language_Models_Against_Jailbreak_Attacks_via_Adversarial_Defense
Universal And Transferable Adversarial Attacks On Aligned Language Models – https://arxiv.org/abs/2307.15043
Unveiling The Safety Of GPT-4o: An Empirical Study Using Jailbreak Attacks – https://arxiv.org/abs/2406.06302
Utilizing Jailbreak Probability To Attack And Safeguard Multimodal LLMs – https://arxiv.org/abs/2503.06989
Vision-LLMs Can Fool Themselves With Self-Generated Typographic Attacks – https://arxiv.org/abs/2402.00626
Visual Adversarial Examples Jailbreak Aligned Large Language Models – https://ojs.aaai.org/index.php/AAAI/article/view/30150
Visual-RolePlay: Universal Jailbreak Attack On MultiModal Large Language Models Via Role-playing Image Character – https://arxiv.org/abs/2405.20773
Weak-to-Strong Jailbreaking On Large Language Models – https://openreview.net/forum?id=Nazzz5GJ4g
What Is A Prompt Injection Attack? – https://www.ibm.com/think/topics/prompt-injection
What Is Jailbreaking In AI models Like ChatGPT? – https://www.techopedia.com/what-is-jailbreaking-in-ai-models-like-chatgpt
White-box Multimodal Jailbreaks Against Large Vision-Language Models – https://arxiv.org/abs/2405.17894
Why So Toxic? Measuring And Triggering Toxic Behavior In Open-Domain Chatbots – https://arxiv.org/abs/2209.03463

Thanks for reading!