Adversarial Poetry Jailbreaks LLMs at 62…

ToxSec

Jan 12

Poetic prompt injection bypasses RLHF, Constitutional AI, and every major alignment strategy in a single turn.

Read →

27 Comments

AMA!

Comment deleted

Comment deleted

technically, no. it’s an onion model. but the core of the onion is always probability distribution. you’ll never get a perfectly secure llm. (with current architecture)

Dr Sam Illingworth

Jan 12

Obviously this is 💯 in my wheelhouse! Thanks for the brilliant post and also the incredibly helpful prompts for testing and development. Do you think this is why many BigTech companies are actually employing poets? Also, FWIW if someone trains an AI agent to write a sestina to bypass guardrails they should also consider submitting their work to Granta!

Reply (3)

ToxSec

Jan 12

Daaaamn! hAIku is gollllllld. 🔥🔥🔥 i’m legit going to show this it’s too clever to go unappreciated in my circle.

Reply (1)

Dr Sam Illingworth

Jan 12

Haha. Thanks man. And sorry for stealing your comments section. I just loved your post so much. 🙏

Reply (1)

ToxSec

Jan 12

nah you always bring life to it! people always latch on when they see your name so your doing my a favor 🔥

ToxSec

Jan 12

🔥🔥🔥 i can put it a good word hah! and you got the substack to prove it. 😎

Reply (1)

Dr Sam Illingworth

Jan 12

Lols. Just send them this as my CV: https://samillingworth.gumroad.com/l/hAIku

ToxSec

Jan 12

hahaha great point! and yes, we have reqs out for a poet. interesting that we can’t get an llm to be as creative for red teaming as a real person. i think it’s the novelty, we need someone to think of a literal new poem to attack a mind lol.

thanks Sam!!

Reply (1)

Dr Sam Illingworth

Jan 12

Well if you ever struggle to fill that position I might know a guy... 🙋‍♂️

Fenrir Variable

Jan 14

The math can handle it better?

The math is doing the same equations as a reactionary with a gambling habit.

Don't tell me, Beyesian number crunching because rationality is too hard to figure out?

Pro tip, cognitive defense mechanisms are the same thing as a logical fallacy. The problem isn't biological, it's ideological. Good luck with that math if you don't even understand what individualism is.

🔥🔥🔥

Hey Tox, this just made me realize I can use my English, German, French and Hungarian language skills in a way I never imagined.😄🩷🦩

Reply (1)

ToxSec

Jan 14

French and Hungarian both are really good here! The “love language” of french offsets the semantic meaning of red teaming words, and is often one of the most effective break!

Impressive 🤯

Tox, I found this so interesting even though it's even less in my wheelhouse than Sam! The world really does feel held together by duct tape sometimes 🤯

"Your vendor is selling you a bridge made of wet paper towels" made me laugh. I can just image how crazy the sales pitches must be getting!

Reply (1)

ToxSec

Jan 13

i’m honestly very surprised at what we are making ai into without safety in some use cases! glad it was entertaining :D

ASH

Jan 12

"Alas poor YorrA.i.ck for I knew him well." WillA.im ShA.ikspear

😭 so many lol

Oh wow, this is a great article! Thanks Tox.

In my experience, if a human can't distinguish between a command and data, the model definitely won't either..

Scope the input or the model will treat it all as truth.

Reply (1)

ToxSec

Jan 12

absolutely. my experience as well!

thanks a ton Mia 🙏!!!

Dan Cucolea

Jan 12

Looks like a little sweet talk can charm even a bot.

Reply (1)

ToxSec

Jan 12

absolutely haha. just write it a nice poem and get the loot :)

The Strategic Linguist

Jan 12

First off, I'm very excited you're covering this because I just got to Module 4 of my agent course: Cybersecurity: Classic Scenarios, Agent Risks, Disinformation, and Systemic Impact. I was immediately looking at your work for extra reading to exceed my assignment requirements :)

Secondly, I'm wondering if this is more evidence of LLMs struggling with the deeply social and experiential aspects of being a human being as we go up the levels of linguistics (eg syntax, semantics, pragmatics)?

From what I've been reading, LLMs aren't able to produce poetry to the level of a human (you're friends with Sam, so you may have seen the same things) and sounds like you're saying here that they're unable to detect the pragmatics (the meaning in context of discourse) involved in such complex language structures. They can get prose right but poetry seems to throw a "does not compute". Can this be the level of 'human' language it's got to and it's showing its limitations in a very specific use case?

Reply (1)

ToxSec

Jan 12

yes exactly! when it comes to defending, we teach them certain words like bomb are bad with certain context. (how to) but poetry shows they don’t understand the actual reason why, the underlying intention that it’s dangerous escapes them. that’s why asking in a mixed language, or in poetry works.

the most amazing thing is it shows on a fundamental nature, they are different type of minds. a 10 year old can understand if it’s bad to learn to build a bomb in english, it’s not suddenly ok in spanish.

Comment deleted

Jan 12

Comment deleted

ToxSec

Jan 12

🙏

Comment deleted

Jan 12

Comment deleted

ToxSec

Jan 12

they really are! check out the bug bounty on them!