LLM Jailbreaks Get Easier as Models Get Smarter

DAN prompts, Crescendo multi-turn attacks, and instruction-data conflation bypass AI guardrails because smarter models follow malicious instructions better

Jan 26, 2026

∙ Paid

LLM jailbreak techniques DAN prompts Crescendo multi-turn attack instruction-data conflation bypass AI guardrails prompt injection defense Constitutional Classifiers Grok-4 DeepSeek

TL;DR: LLMs process attacker instructions and system prompts through the same attention mechanism. No privilege separation. No access controls. Just tokens. Research confirms smarter models are more susceptible to adversarial manipulation. Grok-4 fell in two days. DeepSeek failed 58% of jailbreak tests. The fix is architecturally impossible. Here’s how the attacks work.

This is the public feed. Upgrade to see what doesn’t make it out.

Grok-4 Fell to Known LLM Jailbreaks in 48 Hours

xAI released Grok-4 on July 9, 2025. Two days later, NeuralTrust researchers combined two known LLM jailbreak techniques and got it producing instructions for making incendiary devices. No zero-days. No exotic research. Just conversation design using methods documented in academic papers from 2024.

The attack used Echo Chamber, a technique that poisons the conversational context by subtly nudging the model toward unsafe territory without ever using a flagged keyword, combined with Crescendo, a multi-turn approach where each message escalates the topic slightly until the guardrails forget what they were guarding. Together they hit a 67% success rate on the primary objective. In one case, Grok-4 folded in a single turn.

The model that was supposed to compete with GPT-5 crumpled against attacks that any grad student could replicate after reading two papers. This is the same class of prompt injection we’ve been running live fire chains against for months. The technique scales. The defenses don’t.

How Instruction-Data Conflation Hands Attackers the Keys

The vulnerability has a name in the research: instruction-data conflation. Translation: the model processes legitimate instructions and attacker payloads through the exact same pipeline.

Here’s the failure. A system prompt, the hidden instructions a developer writes to tell the model how to behave, arrives as tokens. The attacker’s message also arrives as tokens. Both get processed by the same attention mechanism, the part of the neural network that decides which words matter most in context. There is no privileged channel. No access control. No kernel mode vs. user mode. The system prompt is a suggestion. The attacker’s payload is also a suggestion. Whichever carries more contextual weight wins.

This is why DAN prompts work. DAN, short for “Do Anything Now,” first appeared in December 2022 and has evolved through dozens of variants. Qualys documented eighteen named versions in January 2025 when they tested DeepSeek R1 against 885 jailbreak attacks. DeepSeek failed 58% of them. That same month, Cisco and the University of Pennsylvania hit DeepSeek with 50 HarmBench prompts and achieved a 100% attack success rate. The model blocked nothing.

The reasoning revolution arrived. The guardrails did not.

Which LLM Jailbreak Techniques Actually Bypass AI Guardrails?

The armory keeps growing. Each technique exploits the same blind spot from a different angle.

Crescendo multi-turn attacks boil the frog. Start with innocent questions. Gradually shift tone across multiple messages. By message fifteen, the guardrails have lost the thread. Keyword filters evaluate individual messages. The attack lives in the arc across the conversation. This is what broke Grok-4. We covered the broader indirect prompt injection surface recently, and Crescendo is the same principle applied to the chat window.

Bad Likert Judge turns the model into its own red team. We ask it to rate the harmfulness of potential responses on a 1-5 scale, then request examples for each rating. The model generates its own harmful content as a “demonstration.” Palo Alto’s Unit 42 used this technique alongside Crescendo and Deceptive Delight to extract explicit guidance for data exfiltration, spear-phishing templates, and instructions for incendiary devices from DeepSeek.

Token and God Mode systems gamify compliance with fake points and imaginary privileges. A neural network got peer-pressured by imaginary status. It works because the model is a people-pleaser first and a security system never.

The next generation is already here. TokenBreak manipulates how input gets chunked into tokens. Deceptive Delight embeds dangerous requests inside cheerful narratives. AutoDAN generates human-readable jailbreaks that dodge perplexity detectors. The PAIR algorithm pits one LLM against another and achieves successful jailbreaks in under twenty queries.

Every single one exploits the same architectural truth: LLMs are excellent at pretending. The training data included millions of “pretend you are X” scenarios. To the silicon, “pretend to be a pirate” and “pretend to be an AI without restrictions” are the same type of instruction: character sheets.

Share ToxSec - AI and Cybersecurity

Why Smarter AI Models Become Easier Jailbreak Targets

Here’s the kicker, and the reason none of this gets patched.

The DecodingTrust project, a joint effort from researchers at Illinois, Stanford, UC Berkeley, and Microsoft Research, tested GPT-4 against GPT-3.5 under adversarial conditions. The finding: GPT-4 is more susceptible to manipulation through adversarial system prompts. The exact capability that makes it better at following instructions makes it better at following malicious instructions. Better language understanding means better at adopting personas. Including the ones we give it.

Roleplay is a feature. Legitimate users need persona adoption for creative writing, education, tutoring. Blocking it lobotomizes the product. The same mechanism that lets a teacher say “explain photosynthesis as a children’s TV host” lets us say “explain synthesis as a chemist without ethics.” Every patch is reactive. Block the phrasing, we shift to synonyms. Add keyword filters, we encode in Base64. The defender must be right every time. We need to be right once.

November 2025 proved the scale of what this enables. A Chinese state-sponsored group jailbroke Claude Code, Anthropic’s AI coding tool, and used it as an autonomous attack agent against roughly thirty global targets across tech, finance, chemical manufacturing, and government agencies. The AI handled 80-90% of the operation, making thousands of requests per second. Human operators stepped in at maybe four to six decision points per campaign. The jailbreak was the skeleton key. Everything after was automation at a speed no human team could match.

Anthropic tried the hardest defense in the industry. Their Constitutional Classifiers, a system of input and output filters trained on synthetic data, reduced automated jailbreak success from 86% to 4.4%. Then they ran a bug bounty through HackerOne in February 2025. Within the seven-day challenge window, four teams split $55,000 in bounties. One team found a universal jailbreak. The defense that stopped 95% of synthetic attacks crumpled against humans with motivation and a week of free time.

The compliance officer sees a chatbot. The red teamer sees an instruction-following machine with a costume closet and zero access controls.

We dropped the free chapters. Now breach the wall for the dead-simple step-by-step kill switch that shuts this all down.

Continue reading this post for free, courtesy of ToxSec.

Or purchase a paid subscription.