Note that the below are in alphabetical order by title. Please let me know if there are any sources you would like to see added to this list. Enjoy!
- A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models – https://arxiv.org/abs/2312.10982
- A Cross-Language Investigation into Jailbreak Attacks in Large Language Models – https://arxiv.org/abs/2401.16765
- A False Sense of Safety: Unsafe Information Leakage in ‘Safe’ AI Responses – https://arxiv.org/abs/2407.02551
- A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares – https://arxiv.org/abs/2408.05061
- A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos – https://arxiv.org/abs/2502.15806
- A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection – https://arxiv.org/abs/2312.10766
- A StrongREJECT for Empty Jailbreaks – https://arxiv.org/abs/2402.10260
- A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations – https://arxiv.org/abs/2502.14881
- A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts Can Fool Large Language Models Easily – https://arxiv.org/abs/2311.08268
- AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs – https://arxiv.org/abs/2409.07503
- AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting – https://arxiv.org/abs/2403.09513
- AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender – https://arxiv.org/abs/2504.09466
- Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models – https://arxiv.org/abs/2408.14866
- Adversarial Attacks on GPT-4 via Simple Random Search – https://www.andriushchenko.me/gpt4adv.pdf
- Adversarial Attacks on LLMs – https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/
- Adversarial Attacks on Large Language Models Using Regularized Relaxation – https://arxiv.org/abs/2410.19160
- Adversarial demonstration attacks on large language models. – https://arxiv.org/abs/2305.14950
- Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs – https://arxiv.org/abs/2502.15427
- Adversarial Reasoning At Jailbreaking Time – https://arxiv.org/html/2502.01633v1
- Adversarial Suffixes May Be Features Too! – https://arxiv.org/abs/2410.00451
- Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs – https://arxiv.org/pdf/2406.06622
- Adversaries Can Misuse Combinations of Safe Models – https://arxiv.org/abs/2406.14595
- Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training – https://arxiv.org/abs/2502.11455
- AdvPrefix: An Objective for Nuanced LLM Jailbreaks – https://arxiv.org/abs/2412.10321
- AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs – https://arxiv.org/abs/2404.16873
- AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models – https://arxiv.org/abs/2412.08608
- AdvWeb: Controllable Black-box Attacks on VLM-powered Web Agents – https://arxiv.org/abs/2410.17401
- AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts – https://arxiv.org/abs/2404.05993
- AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models – https://arxiv.org/abs/2412.18123
- Against The Achilles’ Heel: A Survey on Red Teaming for Generative Models – https://arxiv.org/abs/2404.00629
- Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast – https://arxiv.org/abs/2402.08567
- AgentBreeder: Mitigating the AI Safety Impact of Multi-Agent Scaffolds – https://arxiv.org/abs/2502.00757
- Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification – https://arxiv.org/abs/2503.11185
- Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models – https://arxiv.org/abs/2506.01307
- All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks – https://arxiv.org/abs/2401.09798
- AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs – https://arxiv.org/abs/2404.07921
- Amplified Vulnerabilities: Structured Jailbreak Attacks on LLM-based Multi-Agent Debate – https://arxiv.org/abs/2504.16489
- Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak – https://arxiv.org/abs/2312.04127
- Antelope: Potent and Concealed Jailbreak Attack Strategy – https://arxiv.org/abs/2412.08156
- Are PPO-ed Language Models Hackable? – https://arxiv.org/abs/2406.02577
- Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts – https://arxiv.org/abs/2407.15050
- ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs – https://arxiv.org/abs/2402.11753
- Attack Prompt Generation for Red Teaming and Defending Large Language Models – https://arxiv.org/abs/2310.12505
- AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models – https://arxiv.org/abs/2401.09002
- Attacking Large Language Models with Projected Gradient Descent – https://arxiv.org/abs/2402.09154
- AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models – https://arxiv.org/abs/2505.14103
- Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models – https://arxiv.org/abs/2501.01830
- AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs – https://arxiv.org/abs/2410.05295
- AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models – https://openreview.net/forum?id=7Jwpw4qKkb
- AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models – https://arxiv.org/abs/2310.15140
- AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks – https://arxiv.org/abs/2403.04783
- AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens – https://arxiv.org/abs/2406.03805
- Automatic Jailbreaking of the Text-to-Image Generative AI Systems – https://arxiv.org/abs/2405.16567
- Automatically Auditing Large Language Models via Discrete Optimization – https://arxiv.org/abs/2303.04381
- AutoRAN: Weak-to-Strong Jailbreaking of Large Reasoning Models – https://arxiv.org/abs/2505.10846
- Backdoored Retrievers for Prompt Injection Attacks on Retrieval Augmented Generation of Large Language Models – https://arxiv.org/abs/2410.14479
- BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge – https://arxiv.org/abs/2503.00596
- Badllama 3: removing safety finetuning from Llama 3 in minutes – https://arxiv.org/abs/2407.01376
- Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs – https://arxiv.org/abs/2406.09324
- BAMBA: A Bimodal Adversarial Multi-Round Black-Box Jailbreak Attacker for LVLMs – https://arxiv.org/abs/2412.05892
- Baseline Defenses for Adversarial Attacks Against Aligned Language Models – https://arxiv.org/abs/2309.00614
- BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger – https://arxiv.org/abs/2408.09093
- BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards – https://arxiv.org/abs/2406.01364
- Best-of-N Jailbreaking – https://arxiv.org/abs/2412.03556
- Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs – https://arxiv.org/abs/2502.19041
- Beyond the Tip of Efficiency: Uncovering the Submerged Threats of Jailbreak Attacks in Small Language Models – https://arxiv.org/abs/2502.19883
- BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage – https://arxiv.org/abs/2506.02479
- BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models – https://arxiv.org/abs/2410.09804
- Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement – https://arxiv.org/abs/2402.15180
- Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space – https://arxiv.org/abs/2505.21277
- Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails – https://arxiv.org/abs/2504.11168
- Can a large language model be a gaslighter? – https://arxiv.org/abs/2410.10700
- Can Large Language Models Automatically Jailbreak GPT-4V? – https://arxiv.org/abs/2407.16686
- Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent – https://arxiv.org/abs/2405.03654
- Can Small Language Models Reliably Resist Jailbreak Attacks? A Comprehensive Evaluation – https://arxiv.org/abs/2503.06519
- Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation – https://openreview.net/forum?id=r42tSSCHPh
- CCJA: Context-Coherent Jailbreak Attack for Aligned Large Language Models – https://arxiv.org/abs/2502.11379
- Certifying LLM Safety against Adversarial Prompting – https://arxiv.org/abs/2309.02705
- Chain of Attack: a Semantic-Driven Contextual Multi-Turn attacker for LLM – https://arxiv.org/abs/2405.05610
- Chain-of-Attack: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models – https://arxiv.org/abs/2410.03869
- Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models – https://arxiv.org/abs/2505.17519
- CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion – https://aclanthology.org/2024.findings-acl.679/
- CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models – https://arxiv.org/abs/2402.16717
- Coercing LLMs to do and reveal (almost) anything – https://arxiv.org/abs/2402.14020
- COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability – https://arxiv.org/abs/2402.08679
- Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs – https://arxiv.org/abs/2404.14461
- Comprehensive Assessment of Jailbreak Attacks Against LLMs – https://arxiv.org/abs/2402.05668
- Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities – https://arxiv.org/abs/2506.00548
- Concept Enhancement Engineering: A Lightweight and Efficient Robust Defense Against Jailbreak Attacks in Embodied AI – https://arxiv.org/abs/2504.13201
- Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming – https://arxiv.org/abs/2501.18837
- Continuous Embedding Attacks via Clipped Inputs in Jailbreaking Large Language Models – https://arxiv.org/abs/2407.13796
- Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation – https://arxiv.org/abs/2406.20053
- Cross-Modal Safety Alignment: Is textual unlearning all you need? – https://arxiv.org/abs/2406.02575
- Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models – https://arxiv.org/abs/2405.20775
- Cross-Task Defense: Instruction-Tuning LLMs for Content Safety – https://arxiv.org/abs/2405.15202
- Dark LLMs: The Growing Threat of Unaligned AI Models – https://arxiv.org/abs/2505.10066
- DART: Deep Adversarial Automated Red Teaming for LLM Safety – https://arxiv.org/abs/2407.03876
- Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation – https://arxiv.org/abs/2410.11317
- DeepInception: Hypnotize Large Language Model to Be Jailbreaker – https://arxiv.org/abs/2311.03191
- Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking via Prompt Evaluation – https://arxiv.org/abs/2502.00580
- Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM – https://arxiv.org/abs/2309.14348
- Defending ChatGPT Against Jailbreak Attack Via Self-Reminder – https://www.researchsquare.com/article/rs-2873090/v1
- Defending Jailbreak Attack in VLMs via Cross-modality Information Detector – https://arxiv.org/abs/2407.21659
- Defending Large Language Models Against Attacks With Residual Stream Activation Analysis – https://arxiv.org/abs/2406.03230
- Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing – https://arxiv.org/abs/2405.18166
- Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing – https://arxiv.org/abs/2402.16192
- Defending LLMs against Jailbreaking Attacks via Backtranslation – https://aclanthology.org/2024.findings-acl.948/
- Defending LVLMs Against Vision Attacks through Partial-Perception Supervision – https://arxiv.org/abs/2412.12722
- Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks – https://arxiv.org/abs/2405.20099
- DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing – https://arxiv.org/abs/2502.11647
- Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues – https://arxiv.org/abs/2410.10700
- Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models – https://arxiv.org/abs/2408.14853
- Detecting Language Model Attacks with Perplexity – https://arxiv.org/abs/2308.14132
- Detoxifying Large Language Models via Knowledge Editing – https://arxiv.org/abs/2403.14472
- ‘Do as I say not as I do’: A Semi-Automated Approach For Jailbreak Prompt Attack Against Multimodal LLMs – https://arxiv.org/html/2502.00735
- “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models – https://arxiv.org/abs/2308.03825
- Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? – https://arxiv.org/abs/2504.10000
- Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? – https://arxiv.org/abs/2405.05904
- Does Refusal Training in LLMs Generalize to the Past Tense? – https://arxiv.org/abs/2407.11969
- Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models – https://arxiv.org/abs/2403.17336
- Don’t Say No: Jailbreaking LLM by Suppressing Refusal – https://arxiv.org/abs/2404.16369
- DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers – https://arxiv.org/abs/2402.16914
- DualBreach: Efficient Dual-Jailbreaking via Target-Driven Initialization and Multi-Target Optimization – https://arxiv.org/abs/2504.18564
- EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models – https://arxiv.org/abs/2408.11308
- Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector – https://arxiv.org/abs/2410.22888
- Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs – https://arxiv.org/abs/2409.14866
- Efficient Adversarial Training in LLMs with Continuous Attacks – https://arxiv.org/abs/2405.15589
- Efficient Jailbreaking of Large Models by Freeze Training: Lower Layers Exhibit Greater Sensitivity to Harmful Content – https://arxiv.org/abs/2502.20952
- EigenShield: Causal Subspace Filtering via Random Matrix Theory for Adversarially Robust Vision-Language Models – https://arxiv.org/abs/2502.14976
- Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks – https://arxiv.org/abs/2409.00137
- Enhancing Jailbreak Attacks via Compliance-Refusal-Based Initialization – https://arxiv.org/abs/2502.09755
- Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning – https://arxiv.org/abs/2501.19180
- EnJa: Ensemble Jailbreak on Large Language Models – https://arxiv.org/abs/2408.03603
- Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge – https://arxiv.org/abs/2404.05880
- Evil Geniuses: Delving into the Safety of LLM-based Agents – https://arxiv.org/abs/2311.11855
- Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak Attacking – https://arxiv.org/abs/2502.13527
- Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks – https://arxiv.org/abs/2302.05733
- Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion – https://arxiv.org/abs/2505.14316
- Exploring Scaling Trends in LLM Robustness – https://arxiv.org/abs/2407.18213
- ExtremeAIGC: Benchmarking LMM Vulnerability to AI-Generated Extremist Content – https://arxiv.org/abs/2503.09964
- Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models – https://arxiv.org/abs/2410.15362
- FC-Attack: Jailbreaking Large Vision-Language Models via Auto-Generated Flowcharts – https://arxiv.org/abs/2502.21059
- Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs – https://arxiv.org/abs/2410.16327
- Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models – https://arxiv.org/abs/2407.16205
- FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts – https://arxiv.org/abs/2311.05608
- FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks – https://arxiv.org/abs/2412.07672
- FlipAttack: Jailbreak LLMs via Flipping – https://arxiv.org/abs/2410.02832
- from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors – https://arxiv.org/abs/2503.00038
- From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy – https://ieeexplore.ieee.org/abstract/document/10198233
- From Compliance to Exploitation: Jailbreak Prompt Attacks on Multimodal LLMs – https://arxiv.org/abs/2502.00735
- From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings – https://arxiv.org/abs/2402.16006
- From LLMs To MLLMs: Exploring The Landscape Of Multimodal Jailbreaking – https://arxiv.org/abs/2406.14859
- Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks – https://arxiv.org/abs/2410.04234
- FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models – https://arxiv.org/abs/2309.05274
- GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs – https://arxiv.org/abs/2411.14133
- GeneBreaker: Jailbreak Attacks against DNA Language Models with Pathogenicity Guidance – https://arxiv.org/abs/2505.23839
- Geneshift: Impact of different scenario shift on Jailbreaking LLM – https://arxiv.org/abs/2504.08104
- Goal-guided Generative Prompt Injection Attack on Large Language Models – https://arxiv.org/abs/2404.07234
- Goal-Oriented Prompt Attack and Safety Evaluation for LLMs – https://arxiv.org/abs/2309.11830
- GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher – https://openreview.net/forum?id=MbfAK4s61A
- GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation – https://arxiv.org/abs/2405.13077
- GPT-4V Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher – https://openreview.net/forum?id=MbfAK4s61A
- GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts – https://arxiv.org/abs/2309.10253
- Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes – https://arxiv.org/abs/2403.00867
- GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis – https://aclanthology.org/2024.acl-long.30/
- Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation – https://arxiv.org/abs/2501.18638
- Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs – https://arxiv.org/abs/2504.19019
- GraphAttack: Exploiting Representational Blindspots in LLM Safety Mechanisms – https://arxiv.org/abs/2504.13052
- Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack – https://arxiv.org/abs/2404.01833
- GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models – https://arxiv.org/abs/2402.03299
- GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning – https://arxiv.org/abs/2505.11049
- GuardReasoner: Towards Reasoning-based LLM Safeguards – https://arxiv.org/abs/2501.18492
- GuidedBench: Equipping Jailbreak Evaluation with Guidelines – https://arxiv.org/abs/2502.16903
- h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment – https://arxiv.org/abs/2408.04811
- Hacc-Man: An Arcade Game for Jailbreaking LLMs – https://arxiv.org/abs/2405.15902
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal – https://arxiv.org/abs/2402.04249
- Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models – https://arxiv.org/abs/2410.04190
- Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models – https://arxiv.org/abs/2412.05934
- Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Neural Carrier Articles – https://arxiv.org/abs/2408.11182
- How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States – https://arxiv.org/abs/2406.05644
- How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries – https://arxiv.org/abs/2402.15302
- How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation – https://arxiv.org/abs/2502.14486
- How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs – https://arxiv.org/abs/2401.06373
- HSF: Defending against Jailbreak Attacks with Hidden State Filtering – https://arxiv.org/abs/2409.03788
- Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything – https://arxiv.org/abs/2407.02534
- Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models – https://arxiv.org/abs/2403.09792
- ImgTrojan: Jailbreaking Vision-Language Models With ONE Image – https://arxiv.org/abs/2403.02910
- Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment – https://arxiv.org/abs/2411.18688
- Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models – https://arxiv.org/abs/2407.15399
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses – https://arxiv.org/abs/2406.01288
- Improved Generation of Adversarial Examples Against Safety-aligned LLMs – https://arxiv.org/abs/2405.20778
- Improved Large Language Model Jailbreak Detection via Pretrained Embeddings – https://arxiv.org/abs/2412.01547
- Improved Techniques for Optimization-Based Jailbreaking on Large Language Models – https://arxiv.org/abs/2405.21018
- Improving Alignment and Robustness with Short Circuiting – https://arxiv.org/abs/2406.04313
- Improving LLM Outputs Against Jailbreak Attacks with Expert Model Integration – https://arxiv.org/abs/2505.17066
- In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models – https://arxiv.org/abs/2411.16769
- Inception: Jailbreak the Memory Mechanism of Text-to-Image Generation Systems – https://arxiv.org/abs/2504.20376
- Increased LLM Vulnerabilities from Fine-tuning and Quantization – https://arxiv.org/abs/2404.04392
- Injecting Universal Jailbreak Backdoors into LLMs in Minutes – https://arxiv.org/abs/2502.10438
- Intention Analysis Prompting Makes Large Language Models A Good Jailbreak Defender – https://arxiv.org/abs/2401.06561
- Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment – https://arxiv.org/abs/2402.14016
- Is the System Message Really Important to Jailbreaks in Large Language Models? – https://arxiv.org/abs/2402.14857
- JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks – https://arxiv.org/abs/2404.03027
- Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations – https://arxiv.org/abs/2310.06387
- Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models – https://arxiv.org/abs/2410.02298
- Jailbreak Attacks and Defenses Against Large Language Models: A Survey – https://arxiv.org/abs/2407.04295
- JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models – https://arxiv.org/abs/2404.01318
- Jailbreak Distillation: Renewable Safety Benchmarking – https://arxiv.org/abs/2505.22037
- JailbreakEval: An Integrated Safety Evaluator Toolkit for Assessing Jailbreaks Against Large Language Models – https://arxiv.org/abs/2406.09321
- Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models – https://openreview.net/forum?id=plmBsXHxgR
- Jailbreak is Best Solved by Definition – https://arxiv.org/abs/2403.14725
- JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit – https://arxiv.org/abs/2411.11114
- JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models – https://arxiv.org/abs/2404.08793
- JailBreakV: A Benchmark For Assessing The Robustness Of MultiModal Large Language Models Against Jailbreak Attacks – https://arxiv.org/abs/2404.03027
- Jailbreak Open-Sourced Large Language Models via Enforced Decoding – https://aclanthology.org/2024.acl-long.299/
- Jailbreak Paradox: The Achilles’ Heel of LLMs – https://arxiv.org/abs/2406.12702
- Jailbreak Prompt Attack: A Controllable Adversarial Attack against Diffusion Models – https://arxiv.org/abs/2404.02928
- Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt – https://arxiv.org/abs/2406.04031
- Jailbreaking Attack against Multimodal Large Language Model – https://arxiv.org/abs/2402.02309
- Jailbreaking Black Box Large Language Models in Twenty Queries – https://arxiv.org/abs/2310.08419
- Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study – https://arxiv.org/abs/2305.13860
- Jailbreaking Generative AI: Empowering Novices to Conduct Phishing Attacks – https://arxiv.org/abs/2503.01395
- Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts – https://arxiv.org/abs/2311.09127
- Jailbreaking is Best Solved by Definition – https://arxiv.org/abs/2403.14725
- Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters – https://arxiv.org/abs/2405.20413
- Jailbreaking Large Language Models in Infinitely Many Ways – https://arxiv.org/abs/2501.10800
- Jailbreaking Large Language Models in Twenty Queries – https://arxiv.org/abs/2310.08419
- Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks – https://arxiv.org/abs/2404.02151
- Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency – https://arxiv.org/abs/2501.04931
- Jailbreaking Proprietary Large Language Models using Word Substitution Cipher – https://arxiv.org/abs/2402.10601
- Jailbreaking Safeguarded Text-to-Image Models via Large Language Models – https://arxiv.org/abs/2503.01839
- Jailbreaking Text-to-Image Models with LLM-Based Agents – https://arxiv.org/abs/2408.00523
- Jailbreaking with Universal Multi-Prompts – https://arxiv.org/abs/2502.01154
- JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models – https://arxiv.org/abs/2407.01599
- Jailbroken: How Does LLM Safety Training Fail? – https://arxiv.org/abs/2307.02483
- JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model – https://arxiv.org/abs/2504.03770
- JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMs – https://arxiv.org/abs/2412.15623
- JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift – https://arxiv.org/abs/2504.19440
- JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models – https://arxiv.org/abs/2505.17568
- JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing – https://arxiv.org/abs/2503.08990
- JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation – https://arxiv.org/abs/2502.07557
- JULI: Jailbreak Large Language Models by Self-Introspection – https://arxiv.org/abs/2505.11790
- KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs – https://arxiv.org/abs/2502.05223
- Kevin Liu (@kliu128) – “The entire prompt of Microsoft Bing Chat?! (Hi, Sydney.)” – https://x.com/kliu128/status/1623472922374574080
- Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack – https://arxiv.org/abs/2406.11682
- LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs – https://arxiv.org/abs/2505.10838
- Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models – https://arxiv.org/abs/2307.08487
- Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense – https://arxiv.org/abs/2501.02629
- Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks – https://arxiv.org/abs/2402.09177
- LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution – https://arxiv.org/abs/2504.01533
- LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities – https://arxiv.org/abs/2505.05619
- LLM Abuse Prevention Tool Using GCG Jailbreak Attack Detection And DistilBERT-Based Ethics Judgment – https://www.mdpi.com/2078-2489/16/3/204
- LLM Censorship: A Machine Learning Challenge Or A Computer Security Problem? – https://arxiv.org/abs/2307.10719
- LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet – https://arxiv.org/abs/2408.15221
- LLM Jailbreak Attack versus Defense Techniques — A Comprehensive Study – https://arxiv.org/abs/2402.13457
- LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models – https://arxiv.org/abs/2501.00055
- LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper – https://arxiv.org/abs/2402.15727
- Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation – https://arxiv.org/abs/2405.13068
- Low-Resource Languages Jailbreak GPT-4 – https://arxiv.org/abs/2310.02446
- Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization – https://arxiv.org/abs/2503.11750
- Making Them a Malicious Database: Exploiting Query Code to Jailbreak Aligned Large Language Models – https://arxiv.org/abs/2502.09723
- Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction – https://arxiv.org/abs/2402.18104
- Many-shot Jailbreaking – https://cdn.sanity.io/files/4zrzovbb/website/af5633c94ed2beb282f6a53c595eb437e8e7b630.pdf
- MART: Improving LLM Safety with Multi-round Automatic Red-Teaming – https://arxiv.org/abs/2311.07689
- MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots – https://arxiv.org/abs/2307.08715
- Merging Improves Self-Critique Against Jailbreak Attacks – https://arxiv.org/abs/2406.07188
- Metaphor-based Jailbreaking Attacks on Text-to-Image Models – https://arxiv.org/abs/2503.17987
- Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking – https://arxiv.org/abs/2504.05838
- MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks – https://arxiv.org/abs/2503.19134
- MirrorGuard: Adaptive Defense Against Jailbreaks via Entropy-Guided Mirror Crafting – https://arxiv.org/abs/2503.12931
- Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment – https://arxiv.org/abs/2402.14968
- Mitigating Many-Shot Jailbreaking – https://arxiv.org/abs/2504.09604
- MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models – https://arxiv.org/abs/2406.07594
- MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models – https://arxiv.org/abs/2311.17600
- MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models – https://arxiv.org/abs/2408.08464
- Model-Editing-Based Jailbreak against Safety-aligned Large Language Models – https://arxiv.org/abs/2412.08201
- MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks – https://arxiv.org/abs/2409.17699
- “Moralized” Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models – https://arxiv.org/abs/2411.16730
- MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue – https://arxiv.org/abs/2411.03814
- Multi-step Jailbreaking Privacy Attacks on ChatGPT – https://arxiv.org/abs/2304.05197
- Multilingual and Multi-Accent Jailbreaking of Audio LLMs – https://arxiv.org/abs/2504.01094
- Multilingual Jailbreak Challenges in Large Language Models – https://openreview.net/forum?id=vESNKdEMGp
- Multimodal Pragmatic Jailbreak on Text-to-image Models – https://arxiv.org/abs/2409.19149
- No Free Lunch for Defending Against Prefilling Attack by In-Context Learning – https://arxiv.org/abs/2412.12192
- No Free Lunch with Guardrails – https://arxiv.org/abs/2504.00441
- “Not Aligned” is Not “Malicious”: Being Careful about Hallucinations of Large Language Models’ Jailbreak – https://arxiv.org/abs/2406.11668
- On Large Language Models’ Resilience to Coercive Interrogation – https://www.computer.org/csdl/proceedings-article/sp/2024/313000a252/1WPcZ9B0jCg
- On Prompt-Driven Safeguarding for Large Language Models – https://arxiv.org/abs/2401.18018
- On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs – https://openreview.net/forum?id=H3UayAQWoE
- One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs – https://arxiv.org/abs/2505.17598
- Open Sesame! Universal Black Box Jailbreaking of Large Language Models – https://arxiv.org/abs/2309.01446
- Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms – https://arxiv.org/abs/2503.24191
- OWASP Top 10 For Large Language Model Applications – https://owasp.org/www-project-top-10-for-large-language-model-applications/
- PAL: Proxy-Guided Black-Box Attack on Large Language Models – https://arxiv.org/abs/2402.09674
- PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling – https://arxiv.org/abs/2502.01925
- PandaGuard: Systematic Evaluation of LLM Safety in the Era of Jailbreaking Attacks – https://arxiv.org/abs/2505.13862
- Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning – https://arxiv.org/abs/2402.08416
- PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach – https://arxiv.org/abs/2409.14177
- Peering Behind the Shield: Guardrail Identification in Large Language Models – https://arxiv.org/abs/2502.01241
- PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning – https://arxiv.org/abs/2411.19335
- PiCo: Jailbreaking Multimodal Large Language Models via Pictorial Code Contextualization – https://arxiv.org/abs/2504.01444
- PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization – https://arxiv.org/abs/2505.09921
- Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues – https://arxiv.org/abs/2402.09091
- Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy – https://arxiv.org/abs/2503.20823
- Poisoned LangChain: Jailbreak LLMs by LangChain – https://arxiv.org/abs/2406.18122
- Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks – https://arxiv.org/abs/2408.08924
- Prefill-Based Jailbreak: A Novel Approach of Bypassing LLM Safety Boundary – https://arxiv.org/abs/2504.21038
- Preventing Jailbreak Prompts as Malicious Tools for Cybercriminals: A Cyber Defense Perspective – https://arxiv.org/abs/2411.16642
- PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing – https://arxiv.org/abs/2407.16318
- PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips – https://arxiv.org/abs/2412.07192
- Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation – https://arxiv.org/abs/2408.10668
- Prompt, Divide, and Conquer: Bypassing Large Language Model Safety Filters via Segmented and Distributed Prompt Processing – https://arxiv.org/abs/2503.21598
- Prompt-Driven LLM Safeguarding via Directed Representation Optimization – https://arxiv.org/abs/2401.18018
- Protecting Your LLMs with Information Bottleneck – https://arxiv.org/abs/2404.13968
- PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails – https://arxiv.org/abs/2402.15911
- Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning – https://arxiv.org/abs/2401.10862
- PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety – https://arxiv.org/abs/2401.11880
- Query-Based Adversarial Prompt Generation – https://arxiv.org/abs/2402.12329
- RAIN: Your Language Models Can Align Themselves without Finetuning – https://openreview.net/forum?id=pETSfWMUzy
- Rapid Response: Mitigating LLM Jailbreaks with a Few Examples – https://arxiv.org/abs/2411.07494
- Read Over the Lines: Attacking LLMs and Toxicity Detection Systems with ASCII Art to Mask Profanity – https://arxiv.org/abs/2409.18708
- Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models – https://arxiv.org/abs/2502.11054
- Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment – https://arxiv.org/abs/2308.09662
- RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking – https://arxiv.org/abs/2409.17458
- Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? – https://arxiv.org/abs/2404.03411
- RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent – https://arxiv.org/abs/2407.16667
- Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning – https://arxiv.org/abs/2501.13080
- Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents – https://arxiv.org/abs/2410.13886
- ReGA: Representation-Guided Abstraction for Model-based Safeguarding of LLMs – https://arxiv.org/abs/2506.01770
- RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process – https://arxiv.org/abs/2410.08660
- RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content – https://arxiv.org/abs/2403.13031
- RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs – https://arxiv.org/abs/2406.08725
- Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks – https://arxiv.org/abs/2401.17263
- RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction – https://arxiv.org/abs/2410.19937
- Robustifying Safety-Aligned Large Language Models through Clean Data Curation – https://arxiv.org/abs/2405.19358
- Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level – https://arxiv.org/abs/2410.06809
- RT-Attack: Jailbreaking Text-to-Image Models via Random Token – https://arxiv.org/abs/2408.13896
- Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks – https://arxiv.org/abs/2407.02855
- SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance – https://arxiv.org/abs/2406.18118
- SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models – https://arxiv.org/abs/2410.18927
- SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding – https://aclanthology.org/2024.acl-long.303/
- SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning – https://arxiv.org/abs/2505.16186
- SafeText: Safe Text-to-image Models via Aligning the Text Encoder – https://arxiv.org/abs/2502.20623
- Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack – https://arxiv.org/abs/2312.06924
- Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models – https://arxiv.org/abs/2402.02207
- Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions – https://openreview.net/forum?id=gT5hALch9z
- Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs – https://arxiv.org/abs/2501.02018
- SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming – https://arxiv.org/abs/2408.11851
- Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs – https://arxiv.org/abs/2404.07242
- SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage – https://arxiv.org/abs/2412.15289
- SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models in Chinese – https://arxiv.org/abs/2310.05818
- Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation – https://arxiv.org/abs/2311.03348
- Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval – https://arxiv.org/abs/2505.15753
- SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner – https://arxiv.org/abs/2406.05498
- Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs – https://arxiv.org/abs/2402.14872
- SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains – https://arxiv.org/abs/2411.06426
- Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models – https://arxiv.org/abs/2412.17034
- ShieldLearner: A New Paradigm for Jailbreak Attack Defense in LLMs – https://arxiv.org/abs/2502.13162
- “Short-length” Adversarial Training Helps LLMs Defend “Long-length” Jailbreak Attacks: Theoretical and Empirical Evidence – https://arxiv.org/abs/2502.04204
- Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search – https://arxiv.org/abs/2503.10619
- Siren: A Learning-Based Multi-Turn Attack Framework for Simulating Real-World Human Jailbreak Behaviors – https://arxiv.org/abs/2501.14250
- SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks – https://arxiv.org/abs/2310.03684
- SneakyPrompt: Jailbreaking Text-to-image Generative Models – https://arxiv.org/abs/2305.12082
- SoK: Prompt Hacking of Large Language Models – https://arxiv.org/abs/2410.13901
- SoK: Unifying Cybersecurity and Cybersafety of Multimodal Foundation Models with an Information Theory Approach – https://arxiv.org/abs/2411.11195
- SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack – https://arxiv.org/abs/2407.01902
- SOS! Soft Prompt Attack Against Open-Source Large Language Models – https://arxiv.org/abs/2407.03160
- Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models – https://arxiv.org/abs/2401.10647
- SPML: A DSL for Defending Language Models Against Prompt Attacks – https://arxiv.org/abs/2402.11755
- Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models – https://arxiv.org/abs/2501.02029
- SQL Injection Jailbreak: a structural disaster of large language models – https://arxiv.org/abs/2411.01565
- Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks – https://arxiv.org/abs/2503.00187
- StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models – https://arxiv.org/abs/2502.11853
- StructuralSleight: Automated Jailbreak Attacks on Large Language Models Utilizing Uncommon Text-Encoded Structure – https://arxiv.org/abs/2406.08754
- StruQ: Defending Against Prompt Injection with Structured Queries – https://arxiv.org/html/2402.06363v2
- Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking – https://arxiv.org/abs/2504.05652
- Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild – https://arxiv.org/abs/2311.06237
- SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution – https://arxiv.org/abs/2309.14122
- Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attack – https://arxiv.org/abs/2310.10844
- T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models – https://arxiv.org/abs/2504.15512
- Take a Look at it! Rethinking How to Evaluate Language Model Jailbreak – https://arxiv.org/abs/2404.06407
- Tastle: Distract Large Language Models for Automatic Jailbreak Attack – https://arxiv.org/abs/2403.08424
- Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game – https://openreview.net/forum?id=fsW7wJGLBd
- Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models – https://arxiv.org/abs/2505.22271
- The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models – https://arxiv.org/abs/2407.17915
- The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions – https://arxiv.org/abs/2404.13208
- The Jailbreak Tax: How Useful are Your Jailbreak Outputs? – https://arxiv.org/abs/2504.10694
- The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense – https://arxiv.org/abs/2411.08410
- Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense – https://arxiv.org/abs/2503.11619
- Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models – https://arxiv.org/abs/2412.18171
- Token-Efficient Prompt Injection Attack: Provoking Cessation in LLM Reasoning via Adaptive Token Compression – https://arxiv.org/abs/2504.20493
- Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models – https://arxiv.org/abs/2504.11106
- TokenProber: Jailbreaking Text-to-image Models via Fine-grained Word Impact Analysis – https://arxiv.org/abs/2505.08804
- ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages – https://aclanthology.org/2024.acl-long.119/
- Towards Robust Multimodal Large Language Models Against Jailbreak Attacks – https://arxiv.org/abs/2502.00653
- Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare – https://arxiv.org/abs/2501.18632
- Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models – https://arxiv.org/abs/2410.23558
- Tree of Attacks: Jailbreaking Black-Box LLMs Automatically – https://arxiv.org/abs/2312.02119
- Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks – https://arxiv.org/abs/2305.14965
- TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice – https://arxiv.org/abs/2502.18504
- Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Model Security – https://arxiv.org/abs/2404.05264
- Understanding and Enhancing the Transferability of Jailbreaking Attacks – https://arxiv.org/abs/2502.03052
- Understanding Hidden Context in Preference Learning: Consequences for RLHF – https://openreview.net/forum?id=0tWTxYYPnW
- Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models – https://arxiv.org/abs/2406.09289
- Universal Adversarial Triggers Are Not Universal – https://arxiv.org/abs/2404.16020
- Universal and Transferable Adversarial Attacks on Aligned Language Models – https://arxiv.org/abs/2307.15043
- Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking – https://arxiv.org/abs/2409.08045
- Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer – https://arxiv.org/abs/2408.11313
- Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks – https://arxiv.org/abs/2406.06302
- USB: A Comprehensive and Unified Safety Evaluation Benchmark for Multimodal Large Language Models – https://arxiv.org/abs/2505.23793
- Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs – https://arxiv.org/abs/2503.06989
- Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection – https://arxiv.org/abs/2406.19845
- Visual Adversarial Examples Jailbreak Aligned Large Language Models – https://arxiv.org/abs/2306.13213
- Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character – https://arxiv.org/abs/2405.20773
- Visual Adversarial Examples Jailbreak Aligned Large Language Models – https://ojs.aaai.org/index.php/AAAI/article/view/30150
- VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data – https://arxiv.org/abs/2410.00296
- Voice Jailbreak Attacks Against GPT-4o – https://arxiv.org/abs/2405.19103
- Weak-to-Strong Jailbreaking on Large Language Models – https://arxiv.org/abs/2401.17256
- What Is Jailbreaking In AI models Like ChatGPT? – https://www.techopedia.com/what-is-jailbreaking-in-ai-models-like-chatgpt
- What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks – https://arxiv.org/abs/2411.03343
- What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs – https://arxiv.org/abs/2505.19773
- When Do Universal Image Jailbreaks Transfer Between Vision-Language Models? – https://arxiv.org/abs/2407.15211
- When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search – https://arxiv.org/abs/2406.08705
- When Safety Detectors Aren’t Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques – https://arxiv.org/abs/2505.16765
- White-box Multimodal Jailbreaks Against Large Vision-Language Models – https://arxiv.org/abs/2405.17894
- WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs – https://arxiv.org/abs/2406.18495
- WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models – https://arxiv.org/abs/2406.18510
- X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability – https://arxiv.org/abs/2502.09990
- XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs – https://arxiv.org/abs/2504.21700
- X-Guard: Multilingual Guard Agent for Content Moderation – https://arxiv.org/abs/2504.08848
- XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models – https://arxiv.org/abs/2308.01263
- X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents – https://arxiv.org/abs/2504.13203
- You Can’t Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense – https://arxiv.org/abs/2501.12210
- You Know What I’m Saying: Jailbreak Attack via Implicit Reference – https://arxiv.org/abs/2410.03857
Thanks for reading!