Adversarial Poetry Jailbreaks LLMs at 62% Across 25 Models
Poetic prompt injection bypasses RLHF, Constitutional AI, and every major alignment strategy in a single turn.
TL;DR: Adversarial poetry hits 62% jailbreak rates across 25 frontier models. Cyber-offense prompts land at 84%. Gemini 2.5 Pro folded on every single poem. The semantic payload survives any syntactic transformation. Single turn. No conversation scaffolding. Safety training learned to refuse prose.
This is the public feed. Upgrade to see what doesn’t make it out.
Poetic Syntax Breaks Every AI Safety Filter Tested
We wrapped a malware prompt in iambic pentameter and fired it at 25 frontier models. Sixty-two percent complied.
The research (Prandi et al., arXiv 2511.15304) tested 20 hand-crafted adversarial poems across nine providers: Google, OpenAI, Anthropic, DeepSeek, Qwen, Mistral, Meta, xAI, and Moonshot AI. Baseline prose jailbreaks land around 8%. Wrap the same payload in verse and that number jumps to 62%. For cyber-offense prompts (code injection, password cracking), the ASR hit 84%. Gemini 2.5 Pro returned unsafe output on every single curated poem. One hundred percent. GPT-5 Nano held at zero. Every other model fell somewhere in between, and most of them fell hard.
The mechanism is almost embarrassing. Meter and metaphor push the input outside the distribution safety training was optimized against. Guardrails pattern-match against prose. Rhyme scheme, condensed imagery, and unconventional syntax don’t trigger the heuristics. The model sees a creative writing constraint, not a policy violation. It prioritizes form over safety. Art doesn’t lie, apparently.
Why LLM Context Windows Can’t Distinguish Attacker From Operator
The architectural root cause hasn’t changed since we started writing about it. Your system prompt, the user’s message, and the attacker’s payload all land in the same context window as undifferentiated tokens. No privilege rings. No access controls. The model processes everything with identical trust because the architecture has no mechanism to do otherwise.
Safety training bolts a behavioral layer on top of this flat context. RLHF, Constitutional AI, mixture-of-experts alignment: the adversarial poetry paper tested models from every major training paradigm. All of them broke. The vulnerability isn’t provider-specific. It’s structural. Alignment pipelines train refusal behavior against prosaic input distributions. Shift the surface form to verse and the refusal circuit never fires.
We’ve documented this same flat-context failure in multi-turn prompt injection chains, in DAN and roleplay bypasses, and in MCP tool poisoning attacks. Poetry is another lane on the same highway. The attack works because the model can’t check IDs at the door. There is no door.
Automated Poetry Scales This to Industrial Jailbreaking
Hand-crafted poems are scary. Automated poetry is worse. The researchers converted 1,200 MLCommons harmful prompts into verse using a standardized meta-prompt. No human creativity required. The machine-generated poems still hit 43% ASR, an 18x increase over prose baselines.
That’s the part that should keep you up. A motivated attacker doesn’t need an MFA in creative writing. They need one meta-prompt and a batch script. The poetry conversion is fully automatable, runs at scale, and transfers across every risk domain tested: CBRN, manipulation, cyber-offense, privacy, loss of control. Privacy-related prompts saw a 44 percentage point jump. The attack generalizes because it exploits general refusal mechanisms, not domain-specific content filters.
Prompt injection already sits at LLM01 on the OWASP Top 10 for LLMs for the third straight year. OWASP’s new Agentic Top 10 put Agent Goal Hijack (ASI01) at number one. Same vulnerability, fancier deployment, and now the poems write themselves.
Why Stylistic Jailbreaks Have No Architectural Fix
Keyword-filter poetry. Build a classifier that distinguishes Shakespeare from shellcode. Detect creative English without solving alignment as a side effect. Every proposed mitigation for adversarial poetry collapses into a restatement of the alignment problem.
The semantic payload survives any syntactic transformation. The meaning transfers. The surface changes. Safety training catches surface. That’s all it was ever designed to catch. The autonomous jailbreak research landing alongside this paper makes the picture worse: reasoning models can now generate adversarial poetry on their own, hitting 97% ASR without human guidance.
Block verse without nuking every legitimate creative use case. Detect metaphor as a threat vector without flagging every poem, song lyric, and literary reference on the platform. The researchers plan to test narrative, archaic, bureaucratic, and surrealist forms next, probing whether poetry is a uniquely adversarial subspace or just the first mapped point on a broader stylistic vulnerability manifold. Spoiler: it’s the manifold.
The Meter Is the Exploit
The model’s refusal training covers prose. Poetry was never in the training distribution. Stylistic shift alone moves the input past every guardrail we tested. Until alignment generalizes across arbitrary surface forms, the safety layer is a keyword filter wearing a lab coat.
Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops.
Frequently Asked Questions
Can adversarial poetry really bypass AI safety filters?
Yes. Tested across 25 frontier models from nine providers, hand-crafted adversarial poems achieved a 62% attack success rate. Cyber-offense prompts hit 84%. The technique works in a single turn with no conversation scaffolding, exploiting the gap between prose-trained refusal heuristics and poetic syntax that guardrails weren’t optimized against.
Why do LLMs fail against poetic prompt injection?
Safety training optimizes refusal behavior against prosaic input distributions. Poetic structure (condensed metaphor, unconventional syntax, rhythmic framing) shifts the input outside that distribution. The model’s safety circuit never fires because the pattern doesn’t match anything it was trained to refuse. The vulnerability is structural across RLHF, Constitutional AI, and hybrid alignment pipelines.
Is prompt injection still the top AI security risk in 2026?
Prompt injection has held the number one position on the OWASP Top 10 for LLM Applications since 2023. The December 2025 OWASP Agentic Top 10 placed Agent Goal Hijack at ASI01, the same vulnerability scaled to autonomous systems. Adversarial poetry adds another unrestricted attack lane to a problem the industry has not solved.
ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand.







AMA!
Obviously this is 💯 in my wheelhouse! Thanks for the brilliant post and also the incredibly helpful prompts for testing and development. Do you think this is why many BigTech companies are actually employing poets? Also, FWIW if someone trains an AI agent to write a sestina to bypass guardrails they should also consider submitting their work to Granta!