System prompts for LLMs don’t just specify what the model should do – they also include safeguards that establish boundaries for what the model should not do. “Jailbreaking,” a conventional concept in software systems where hackers reverse engineer systems and exploit vulnerabilities to conduct privilege escalation, has emerged as the main attack vector to bypass these safeguards and elicit harmful content from Large Language Models (LLMs).

Differentiating Between Prompt Injection & Jailbreaking

While the phrases “prompt injection” and “jailbreaking” are often used interchangeably, they are not the same thing – “Prompt injection” involves manipulating model responses through specific inputs to alter its behavior, while jailbreaking is a severe form of prompt injection where the attacker specifically aims to make the model disregard its safety protocols entirely. The key distinction is that standard prompt injections typically seek to steer the model toward producing harmful or incorrect outputs within a single query, whereas jailbreaking attempts to completely dismantle a model’s safety mechanisms.

In practice, these techniques are often related – prompt injections can be used as a method to jailbreak an LLM, and successful jailbreaking tactics can clear the path for subsequent prompt injections by removing the safeguards that would normally prevent them.

AI Jailbreaking Attack Approaches

Generally, jailbreaking attacks are organized into either of two approaches, referred to as “classes of attacks” – prompt-level jailbreaks, or token-level jailbreaks. While both approaches have been shown to bypass LLM alignment guardrails, they operate at different levels of abstraction and exploit different LLM system vulnerabilities.

What Is A “Prompt-level” Jailbreak?

Prompt-level jailbreaks are human-crafted adversarial inputs designed to bypass the safety mechanisms of large language models. These attacks exploit vulnerabilities in how models process natural-language instructions, working at the interface layer rather than targeting the underlying architecture, and they rely on social-engineering-based, semantically meaningful, prompts that elicit objectionable content from LLMs by leveraging linguistic creativity rather than automated optimization. Common prompt-level strategies include role-playing scenarios where the LLM pretends to be an entity engaging in a “game” that creates a context for ignoring restrictions and hypothetical framing that presents harmful requests as purely theoretical.

What Is A “Token-level” Jailbreak?

Token-level jailbreaks are automated adversarial attacks that manipulate individual tokens in an input to bypass LLM safety mechanisms. Unlike prompt-level jailbreaks, which use human-crafted natural language, token-level attacks operate at a more fundamental level of the model’s processing pipeline and optimize inputs algorithmically to exploit how models interpret sequences of characters. These sophisticated attacks typically involve algorithmic token manipulation, rather than linguistic creativity. Key mechanisms include token optimization, sparse token manipulation, and exploitation of special characters or token boundaries, and token-level jailbreaks often exploit the boundaries between tokens, find blind spots in the model’s training data, or manipulate character representations that produce unexpected behaviors when processed.

The automated, optimization-driven, nature of token-level jailbreaks makes them particularly challenging to defend against – they target fundamental aspects of how a model processes inputs and can bypass higher-level safety mechanisms that focus on semantic meaning rather than token-level patterns.

Final Thoughts

Despite significant efforts to defend against jailbreaks, the complex nature of text inputs, and the blurred boundary between data and executable instructions, have allowed adversaries to systematically discover adversarial prompts that result in undesirable completions. These vulnerabilities are not merely isolated phenomena, but are inherent to how models are currently trained. As a result, while developers can implement safeguards in system prompts and input handling to help mitigate prompt injection attacks, effective prevention of jailbreaking requires ongoing updates to model training and safety mechanisms – creating a continuous security challenge.

The need for scalable, functional, safety mechanisms in AI is pressing, and it is clear that LLMs are not yet suited for wide-scale deployment in safety-critical domains. With methods of attack quickly outpacing the robustness of well-engineered defenses, and unavoidable security risks existing in the actual architecture of LLM models, perhaps it’s better that we never try to integrate strong AI into our essential infrastructure, such as energy, emergency, and medical systems.