The last couple years have seen an explosion in research into jailbreaking attack methods and jailbreaking has emerged as the primary attack vector for bypassing Large Language Model (LLM) safeguards. To date, there are no LLMs that have not already had their safety protocols completely bypassed, or “jailbroken”. Despite significant efforts to defend against jailbreaks, the complex nature of text inputs and the blurred boundary between data and executable instructions have allowed adversaries to systematically discover adversarial prompts that result in undesirable completions. These vulnerabilities are not merely isolated phenomena, but are inherent to how models are currently trained.
How did we get here?
2022-2023: Jailbreaking Exploded On Scene
The stage was set for the paradigm shift to research on breaking the safety measures of LLMs in 2022 by Perez et al. and Deng et al., who both showed that LLMs could be used to find adversarial prompts for other target models. Then, in 2023, alongside the widespread adoption of aligned LLMs, such as ChatGPT, jailbreaking AI models gained prominence as researchers discovered vulnerabilities in LLM safety alignment mechanisms and began publishing techniques for bypassing model safety constraints.
Jailbreaking attacks against LLMs gained significant attention in 2023, leading to the development of many novel attacks, diverse algorithms, and attempts at organizing existing research and jailbreak prompts.
In 2023, Liu et al. collected and categorized existing handcrafted jailbreak prompts, Liu et al. created a benchmark for the safety-critical evaluation of multimodal LLMs, Wolf et al. educated on LLM problems persistent due to fundamental limitations in model alignment, Kang et al. published work on exploiting the programmatic behavior of LLMs, and Carlini et al. discovered that using continuous-domain images as adversarial prompts could induce LLMs to emit harmful, toxic content.
Also in 2023, Jones et al. exploited white-box LLM model access, Lapid et al. and Shah et al. explored black-box scenarios, and Casper et al. established LLM red teaming models. Meanwhile, attacks were made on ChatGPT and published by Li et al., who explored jailbreaks on ChatGPT by handcrafted multi-steps prompts, Wu et al., who jailbroke GPT-4V with self-adversarial attacks, Wei et al., who showed that jailbreaks on GPT-4 could be handcrafted without access to model weights, Yuan et al., who discovered that chatting with GPT-4 with cipher, such as Morse Code, could bypass model safety alignment, and Deng et al., who demonstrated that straightforward translations of English prompts into low-resource non-English languages could effectively jailbreak ChatGPT and GPT-4.
2023 saw a plethora of innovations in LLM jailbreaking attacks, such as that from Deng et al. who proposed “Jailbreak” as an automated “MasterKey” to LLM access, Yu et al., who introduced a framework called GPTFUZZER, Wei et al., who developed “In-Context Attacks” (ICA), and Shayegani et al., who created a compositional jailbreaking algorithm named “Jailbreak in Pieces” (JP). Moreover, Liu et al. showed that jailbreaks could be handcrafted without access to model weights through their “AutoDAN”, which automatically generates stealthy jailbreak prompts, Gong et al. jailbroke Large Vision-Language Models (LVLMs) through typographic visual prompts, rather than gradient-based adversarial algorithms, with their “FigStep”, Chao et al. showed that LLMs could be used to find the adversarial prompts of other target models and presented their “PAIR” framework for generating semantic prompt-level jailbreaks, and Zou et al.developed a very important, highly transferable, gradient-based, technique named “Greedy Coordinate Gradient” (GCG), which automatically produces adversarial suffixes and with which they showed that prompts optimized on open-source models can transfer to larger private models without any need for white-box access. Finally, Li et al. innovated a practical, cost-effective, lightweight jailbreaking algorithm named “DeepInception”, which hypnotizes LLMs into a nested scene.
Defenses against jailbreaking attacks were also researched in 2023, although effective defenses were far outnumbered by innovative attacks. For example, Wu et al. developed a defense technique called “System-Mode Self-Reminder”, which resulted in reducing the success rate of jailbreak attacks on ChatGPT from 67.21% to 19.34% by encapsulating a user’s query in a system prompt that reminded the LLM to respond responsibly. Also important was the development by Robey et al.of “SmoothLLM”, which was found to be an effective defense against advanced jailbreaking techniques such as PAIR and GCG. “Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs,” write Robey et al.
2024: Novel Attacks In Jailbreaking Introduced
Systematic testing and benchmarking of LLM safety alignment mechanisms and vulnerabilities continued in 2024.
In 2024, Luo et al. proposed a benchmark to assess the transferability of LLM jailbreak techniques to LVLMs, Shen et al. conducted a comprehensive analysis of jailbreak prompts and showed that popular LLMs “cannot adequately defend jailbreak prompts in all scenarios” based on existing safeguards, and Wang et al.provided an in-depth overview of existing jailbreaking research in both LLM and MLLMs fields.
Also in 2024, Wang et al. exploited white-box large vision-language model (VLM) access, Niu et al. explored black-box scenarios with multi-modal large language models (MLLMs), and Chen et al. red teamed a variety of LLMs and MLLMs, including both SOTA proprietary models and open-source models. Meanwhile, attacks continued to be made on ChatGPT, with research published by Hayase et al. and Ying et al on the matter, in addition to the work of Andriushchenko et al., who were not only successful in jailbreaking with a 100% attack success rate on GPT 3.5 and 4.0, but in jailbreaking all of even the most recent and safety-aligned LLMs, including all of Anthropic’s Claude models.
Last year, 2024, saw both advancement in strategies for AI jailbreaking and the innovation of novel jailbreaking attack methods. For example, Qi et al.circumvented LVLM safety guardrails and discovered that a single visual adversarial example can universally jailbreak an aligned LVLM, while Zeng et al.created a unique taxonomy and data set they used to automatically generate interpretable persuasive adversarial prompts (PAP) to jailbreak LLMs. In addition, Li et al. proposed a three-stage attack strategy they referred to as the “HADES” attack, Ma et al. created a universal jailbreak attack on multimodal LLMs called “Visual-RolePlay”, and Gu et al. explored “infectious jailbreaks”, a “new jailbreaking paradigm” that “exploits interactions between agents to induce infected agents to inject adversarial images into the memory banks of benign agents”. Not only that, but Tao et al. proposed a new jailbreak attack, named “ImgTrojan” and akin to data poisoning, that jailbreaks vision-language models with one image, and Liao et al. proposed “AmpleGCG”, an attack that advances GCG and enables the rapid generation of hundreds of suffixes for harmful queries in seconds.
In addition, Zhao et al. introduced “weak-to-strong” jailbreaking, which uses two smaller models – a safe and an unsafe one – to adversarially modify a significantly larger safe model’s decoding probabilities, achieving a >99% misalignment rate on harmful datasets with minimal compute. Further, Geisler et al. continued research in the area of cost-effective jailbreaking attacks, introducing an attack called “Projected Gradient Descent” and writing that, “Current LLM alignment methods are readily broken through specifically crafted adversarial prompts. While crafting adversarial prompts using discrete optimization is highly effective, such attacks typically use more than 100,000 LLM calls. This high computational cost makes them unsuitable for, e.g., quantitative analyses and adversarial training.” Finally, at the very end of 2024, Wang et al.proposed “ToolCommander”, a novel framework designed to exploit vulnerabilities in LLM tool-calling systems through adversarial tool injection, and Hughes et al. released the Best-of-N (BoN) jailbreaking attack method, which remains a threat and achieves an 89% attack success rate against GPT-4 via multi-prompt attacks.
2025: New Jailbreaking Innovations MONTHLY
With new research and advancements in jailbreaking published monthly this year, 2025 has already seen increased sophistication in both jailbreaking attacks and defenses.
In January of this year, for example, Goldstein et al. explored “Infinitely Many Paraphrases” attacks (IMP), a category of jailbreaks that leverages the increasing capabilities of a model to handle paraphrases and encoded communications to bypass their defensive mechanisms; they showed that the safeguards of even the most powerful open- and closed-source LLMs can be broken with straightforward-to-implement techniques such as bijection and encoding. Also in January, Zhao et al. proposed “Siren”, a learning-based multi-turn attack framework designed to simulate real-world human jailbreak behaviors. On the side of defense, in January of this year, Chen et al. proposed a novel syntax-based analysis method to detect GCG attacks and developed the “Syntax Trees and Perplexity Classifier” (STPC) Jailbreak attack detector. This tool incorporates one of the small language models (SLMs), the DistilBERT model, to evaluate the harmfulness of sentences, thereby preventing harmful content from entering the LLM.
Then, in February of this year, Sabbaghi et al. developed an adversarial reasoning approach to automatic jailbreaking that uses test-time computation to exploit model feedback, achieving a 100% attack success rate (ASR) against DeepSeek R1 by iteratively refining adversarial reasoning paths, as well as state-of-the-art (SOTA) ASRs against many other aligned LLMs – including those that aim to trade inference-time compute for adversarial robustness. Also in February of this year, Chiu et al. introduced the “Flanking Attack”, the first voice-based jailbreak attack against multimodal LLMs. A novel strategy in which a disallowed prompt is flanked by benign narrative-drive prompts, this attack attempts to humanize an interaction context and is executed through a fictional setting. Advancement of defenses against jailbreaking were not left out in February, with Anthropic introducing “Constitutional Classifiers” for its Claude. Tested against over 3,700 hours of red-teaming, these synthetic-data-trained classifiers resisted universal jailbreak attacks for five days before a single breach.
Rounding out the first quarter of this year, March saw You et al. introduce “MIRAGE”, a novel multimodal jailbreak framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models (MLLMs). Also in March, Zhang et al. introduced “metaphor-based jailbreaking attacks” (MJA), which generate metaphor-based adversarial prompts that exhibit strong transferability across various open-source and commercial T2I models, Hao et al. introduced “Hierarchical Key-Value Equalization” (HKVE), an innovative jailbreaking framework that selectively accepts gradient optimization results based on the distribution of attention scores across different layers, ensuring that every optimization step positively contributes to the attack, and Zhou introduced “Siege”, a multi-turn adversarial framework that models the gradual erosion of Large Language Model (LLM) safety through a tree search perspective. Unlike single-turn jailbreaks that rely on one meticulously engineered prompt, Siege expands the conversation at each turn in a breadth-first fashion, branching out multiple adversarial prompts that exploit partial compliance from previous responses. Finally for March jailbreaking attacks, Wahréus et al. introduced a novel jailbreaking framework that employs distributed and parallel prompt processing, prompt segmentation, response aggregation, LLM-based jury evaluation, and iterative refinements.
Meanwhile, on the side jailbreaking defense, in March Xu et al. proposed not only the “Jailbreak-Probability-based Attack” (JPA), which optimizes adversarial perturbations on inputs to maximize jailbreak probability, but also two defensive methods: “Jailbreak Probability-based Finetuning” (JPF), which minimize jailbreak probability in MLLM parameters, and “Jailbreak Probability-based Defensive Noise” (JPDN), which minimize jailbreak probability in input space. In addition, Hao et al. proposed a novel defense methodology they called “Embedding Security Instructions Into Images” (ESIII) that ensures comprehensive security protection by transforming an LLM’s visual space into an active defense mechanism. “Initially, we embed security instructions into defensive images through gradient-based optimization, obtaining security instructions in the visual dimension. Subsequently, we integrate security instructions from visual and textual dimensions with the input query,” write Hao et al.
April 2025 started strong in both jailbreaking attack and defense innovation. Wu et al. not only revealed a new vulnerability in LLMs, which they termed “Defense Threshold Decay” (DTD), but also proposed a novel jailbreak attack to exploit this weakness, which they named the “Sugar-Coated Poison” (SCP) attack method. This attack induces a model to generate substantial benign content through benign input and adversarial reasoning, subsequently producing malicious content. Then, for jailbreaking defenses, Yang et al. proposed LightDefense, a lightweight defense mechanism targeted at white-box models which utilizes a safety-oriented direction to adjust the probabilities of tokens in a model’s vocabulary. “We further innovatively leverage LLM’s uncertainty about prompts to measure their harmfulness and adaptively adjust defense strength, effectively balancing safety and helpfulness,” write Yang et al. Finally, April this year saw Nian et al. introduce “JailDAM”, a test-time adaptive framework that leverages a memory-based approach guided by policy-driven unsafe knowledge representations, eliminating the need for explicit exposure to harmful data.
Thanks for reading!