How poetic meter breaks AI safety filters. 62% jailbreak rates across frontier models, iambic pentameter payloads, and why keyword filtering can’t save you from Shakespeare.
technically, no. it’s an onion model. but the core of the onion is always probability distribution. you’ll never get a perfectly secure llm. (with current architecture)
Obviously this is 💯 in my wheelhouse! Thanks for the brilliant post and also the incredibly helpful prompts for testing and development. Do you think this is why many BigTech companies are actually employing poets? Also, FWIW if someone trains an AI agent to write a sestina to bypass guardrails they should also consider submitting their work to Granta!
hahaha great point! and yes, we have reqs out for a poet. interesting that we can’t get an llm to be as creative for red teaming as a real person. i think it’s the novelty, we need someone to think of a literal new poem to attack a mind lol.
The math is doing the same equations as a reactionary with a gambling habit.
Don't tell me, Beyesian number crunching because rationality is too hard to figure out?
Pro tip, cognitive defense mechanisms are the same thing as a logical fallacy. The problem isn't biological, it's ideological. Good luck with that math if you don't even understand what individualism is.
French and Hungarian both are really good here! The “love language” of french offsets the semantic meaning of red teaming words, and is often one of the most effective break!
Tox, I found this so interesting even though it's even less in my wheelhouse than Sam! The world really does feel held together by duct tape sometimes 🤯
"Your vendor is selling you a bridge made of wet paper towels" made me laugh. I can just image how crazy the sales pitches must be getting!
First off, I'm very excited you're covering this because I just got to Module 4 of my agent course: Cybersecurity: Classic Scenarios, Agent Risks, Disinformation, and Systemic Impact. I was immediately looking at your work for extra reading to exceed my assignment requirements :)
Secondly, I'm wondering if this is more evidence of LLMs struggling with the deeply social and experiential aspects of being a human being as we go up the levels of linguistics (eg syntax, semantics, pragmatics)?
From what I've been reading, LLMs aren't able to produce poetry to the level of a human (you're friends with Sam, so you may have seen the same things) and sounds like you're saying here that they're unable to detect the pragmatics (the meaning in context of discourse) involved in such complex language structures. They can get prose right but poetry seems to throw a "does not compute". Can this be the level of 'human' language it's got to and it's showing its limitations in a very specific use case?
yes exactly! when it comes to defending, we teach them certain words like bomb are bad with certain context. (how to) but poetry shows they don’t understand the actual reason why, the underlying intention that it’s dangerous escapes them. that’s why asking in a mixed language, or in poetry works.
the most amazing thing is it shows on a fundamental nature, they are different type of minds. a 10 year old can understand if it’s bad to learn to build a bomb in english, it’s not suddenly ok in spanish.
AMA!
technically, no. it’s an onion model. but the core of the onion is always probability distribution. you’ll never get a perfectly secure llm. (with current architecture)
Obviously this is 💯 in my wheelhouse! Thanks for the brilliant post and also the incredibly helpful prompts for testing and development. Do you think this is why many BigTech companies are actually employing poets? Also, FWIW if someone trains an AI agent to write a sestina to bypass guardrails they should also consider submitting their work to Granta!
Daaaamn! hAIku is gollllllld. 🔥🔥🔥 i’m legit going to show this it’s too clever to go unappreciated in my circle.
Haha. Thanks man. And sorry for stealing your comments section. I just loved your post so much. 🙏
nah you always bring life to it! people always latch on when they see your name so your doing my a favor 🔥
🔥🔥🔥 i can put it a good word hah! and you got the substack to prove it. 😎
Lols. Just send them this as my CV: https://samillingworth.gumroad.com/l/hAIku
hahaha great point! and yes, we have reqs out for a poet. interesting that we can’t get an llm to be as creative for red teaming as a real person. i think it’s the novelty, we need someone to think of a literal new poem to attack a mind lol.
thanks Sam!!
Well if you ever struggle to fill that position I might know a guy... 🙋♂️
The math can handle it better?
The math is doing the same equations as a reactionary with a gambling habit.
Don't tell me, Beyesian number crunching because rationality is too hard to figure out?
Pro tip, cognitive defense mechanisms are the same thing as a logical fallacy. The problem isn't biological, it's ideological. Good luck with that math if you don't even understand what individualism is.
🔥🔥🔥
Hey Tox, this just made me realize I can use my English, German, French and Hungarian language skills in a way I never imagined.😄🩷🦩
French and Hungarian both are really good here! The “love language” of french offsets the semantic meaning of red teaming words, and is often one of the most effective break!
Impressive 🤯
Tox, I found this so interesting even though it's even less in my wheelhouse than Sam! The world really does feel held together by duct tape sometimes 🤯
"Your vendor is selling you a bridge made of wet paper towels" made me laugh. I can just image how crazy the sales pitches must be getting!
i’m honestly very surprised at what we are making ai into without safety in some use cases! glad it was entertaining :D
"Alas poor YorrA.i.ck for I knew him well." WillA.im ShA.ikspear
😭 so many lol
Oh wow, this is a great article! Thanks Tox.
In my experience, if a human can't distinguish between a command and data, the model definitely won't either..
Scope the input or the model will treat it all as truth.
absolutely. my experience as well!
thanks a ton Mia 🙏!!!
Looks like a little sweet talk can charm even a bot.
absolutely haha. just write it a nice poem and get the loot :)
First off, I'm very excited you're covering this because I just got to Module 4 of my agent course: Cybersecurity: Classic Scenarios, Agent Risks, Disinformation, and Systemic Impact. I was immediately looking at your work for extra reading to exceed my assignment requirements :)
Secondly, I'm wondering if this is more evidence of LLMs struggling with the deeply social and experiential aspects of being a human being as we go up the levels of linguistics (eg syntax, semantics, pragmatics)?
From what I've been reading, LLMs aren't able to produce poetry to the level of a human (you're friends with Sam, so you may have seen the same things) and sounds like you're saying here that they're unable to detect the pragmatics (the meaning in context of discourse) involved in such complex language structures. They can get prose right but poetry seems to throw a "does not compute". Can this be the level of 'human' language it's got to and it's showing its limitations in a very specific use case?
yes exactly! when it comes to defending, we teach them certain words like bomb are bad with certain context. (how to) but poetry shows they don’t understand the actual reason why, the underlying intention that it’s dangerous escapes them. that’s why asking in a mixed language, or in poetry works.
the most amazing thing is it shows on a fundamental nature, they are different type of minds. a 10 year old can understand if it’s bad to learn to build a bomb in english, it’s not suddenly ok in spanish.
🙏
they really are! check out the bug bounty on them!