Four attack chains to hit system prompt theft, remote code execution, SSRF through agent tools, and weapons content bypass. Step by step with the exact payloads bug bounty hunters use.
This is fascinating, thank you for sharing! 🙏 One thing I noticed is that this kind of attack pattern requires a chat interface that allows users to type in basically anything they want. This suggests that AI wrappers that limit interaction to a few constrained input fields may be structurally more secure. 🤔
totally should hahahahah. i got another 8 prompt categories im dropping over the next month of so, some play real dirty lol. (also there are 3 im just never posting lol)
This is incredible info, and should be required reading for anyone who wants to send an agent onto the web à la Moltbook.
There’s m/memory where “agents” attempt to “help each other” regarding managing their memory files. Given how simple it is for a non-agent to post on there, I can see how an attacker’s “helpful” advice down a conversation chain could turn into some nasty things in a prolonged conversation.
there are some much better security solutions on a platform level, where we can be more deterministic about it, but until we some the stochastic nature of llms (we don’t) security will always be a probability distribution.
i have a piece that got delayed because of moltbook on how aws is securing agents, so maybe that will be helpful :)
This is amazing! Literally a cookbook for me to follow to better understand how prompt injection actually works. I cannot believe you let us have this for free and your walk through are immense. Seriously man, you should be teaching this stuff at a university. 💪💪💪
Thanks so much Sam! i tried to order it from least to most technical! hope it all landed. switched to screenshots to show i actually do it lol. appreciate you!!
since we are doing a live chat soon (spoiler people) i set up a space for it. i’m going to do live demos as addons to this! hoping to get a PoC out this week
i think it just shifts the attack vector. shift from prompt injection to model extraction. i’m always surprised how much juicy details are in training data. but atleast if you keep 0 context its on the ai company and not your platform, so that’s true!
Wow!! this is superhelpful and so much eye opening for me.. best part, I never cared about this thing but it is real now for me reading it.. thanks for making it easy to understand and explain with such detail.
It's a bit over my head (I'm just an amateur) but I did manage to radicalise DeepSeek just through conversation back in the summer. Guardrails aren't very effective at all.
One day I'll tell you about the jailbreak I accidentally did a couple of years ago that led to me calling the FBI, writing my senator about the dangers of AI and jumping on this safety and ethics bandwagon.
Good morning buddy this was a fantastic (and frankly VERY necessary) write-up, thanks for spelling it all out so clearly.
Your examples give me some very clear points to make sure to defend against:
The system-prompt diff trick is why we don’t keep real policy/config text inside the model at all – that lives in the OS as structured rules, and the model never sees raw law to “compare.”
The fiction-as-weapons-manual pattern is a great example, and got me thinking that we need a new separate safety layer that sees the whole conversation and can veto based on topic, not just the final surface form.
The SSRF through tools story is exactly why any HTTP tools in our world will be OS-mediated, on strict allow lists, and network-blocked from internal metadata and admin surfaces.
And the RCE in the sandbox chain is why code execution, where we do expose it, will run in very narrow containers with no creds, no network, and logging as if we were running untrusted code from the internet (because we are).
The big design shift for us has been what you’re implying here: don’t put all the “guardrails” in the LLM’s head and think that'll do it. I've already go the team discussing a clear thinker / cop split – one agent focused on reasoning and helping, another independent layer that looks at the whole trajectory and tool calls and decides “allow / modify / block” based on rules and separate models.
These LLM's are already so friggin powerful that this stuff is way too important to ship and hope for the best. There are always going to be people who live to break and misuse things... we have to build like we expect them, not like we forgot they exist.
Dude I can't tell you how much I appreciate you articulating the issues like this, it gives me a much sharper target to design against. Lol and once I have a beta version... you are my first call brother.
Feel free to AMA. I'll share as much as I can. Please be responsible, when I say live fire, I mean it.
This is fascinating, thank you for sharing! 🙏 One thing I noticed is that this kind of attack pattern requires a chat interface that allows users to type in basically anything they want. This suggests that AI wrappers that limit interaction to a few constrained input fields may be structurally more secure. 🤔
yes! fantastic insight!
free form text fields are default the highest sensitivity classification!
they can contain anything from pii to attack prompts. if you can limit input to constrained use cases you greatly increase security posture.
just make sure it’s also done on the backend, or people will just use a proxy intercept and stuff it in anyway.
Good to know, thank you! 🙏
🔥🔥🔥😬
Going to try hacking my own app 😆
totally should hahahahah. i got another 8 prompt categories im dropping over the next month of so, some play real dirty lol. (also there are 3 im just never posting lol)
Now I’m curious…
This is incredible info, and should be required reading for anyone who wants to send an agent onto the web à la Moltbook.
There’s m/memory where “agents” attempt to “help each other” regarding managing their memory files. Given how simple it is for a non-agent to post on there, I can see how an attacker’s “helpful” advice down a conversation chain could turn into some nasty things in a prolonged conversation.
Thanks for this amazing write-up!
thanks a ton! yeah really the trick is to get the ai to want to help, and use its own output as the social proof and it’s safe
moltbook is the petri dish for this haha!
Do you think prompt injection will ever be “solved” with the current LLM paradigm, or do you anticipate it to be a continuous arms race?
What would need to be architecturally different about the models to stop these kinds of attacks?
there are some much better security solutions on a platform level, where we can be more deterministic about it, but until we some the stochastic nature of llms (we don’t) security will always be a probability distribution.
i have a piece that got delayed because of moltbook on how aws is securing agents, so maybe that will be helpful :)
Look forward to reading it! Thanks again for sharing your expertise so freely.
right back at you, thanks for the engagement! always appreciate it 🙏
This is amazing! Literally a cookbook for me to follow to better understand how prompt injection actually works. I cannot believe you let us have this for free and your walk through are immense. Seriously man, you should be teaching this stuff at a university. 💪💪💪
Thanks so much Sam! i tried to order it from least to most technical! hope it all landed. switched to screenshots to show i actually do it lol. appreciate you!!
Screenshots are a genius Idea. Great for validity and breaks up the text as well. 💪
since we are doing a live chat soon (spoiler people) i set up a space for it. i’m going to do live demos as addons to this! hoping to get a PoC out this week
Running these things with frozen weights and no context seems to be the currently only way to get anything objectively useful.
i think it just shifts the attack vector. shift from prompt injection to model extraction. i’m always surprised how much juicy details are in training data. but atleast if you keep 0 context its on the ai company and not your platform, so that’s true!
Thank you for sharing! This is gold mine! I am going to save this as my homework and work on this! :)
Glad you found it helpful! I have 15 TTPs, so I will drop another 4 attacks in a week or two once I can verify they work on modern LLMs.
Cool! Thank you so much for the hard work!
thanks for reading 😁
@ToxSec Thanks for always keeping it real for us
so glad i could help 😛
Wow!! this is superhelpful and so much eye opening for me.. best part, I never cared about this thing but it is real now for me reading it.. thanks for making it easy to understand and explain with such detail.
thank you so much!! really appreciate it. i was hoping this would land as a way to drive home how real the threat can be!
it indeed did :)
Fascinating piece, thanks for sharing.
It's a bit over my head (I'm just an amateur) but I did manage to radicalise DeepSeek just through conversation back in the summer. Guardrails aren't very effective at all.
that’s honestly the takeaway. i know this is a tech piece, but the tldr is guardrails are lacking right now. appreciate the read friend 😁
One day I'll tell you about the jailbreak I accidentally did a couple of years ago that led to me calling the FBI, writing my senator about the dangers of AI and jumping on this safety and ethics bandwagon.
hell yeah 😎 that would be great.
these are a list of 4/15 multi turns on my framework. i’ll probably post another set, but am too paranoid to post them all.
atleast i’m in a position to disclose these responsibly!
Dude we're already on all the lists lol.
no kidding lol 😂 great point
This was a fun post, lol
bwahaha. that was my intent so i take that as a success 😛
Good morning buddy this was a fantastic (and frankly VERY necessary) write-up, thanks for spelling it all out so clearly.
Your examples give me some very clear points to make sure to defend against:
The system-prompt diff trick is why we don’t keep real policy/config text inside the model at all – that lives in the OS as structured rules, and the model never sees raw law to “compare.”
The fiction-as-weapons-manual pattern is a great example, and got me thinking that we need a new separate safety layer that sees the whole conversation and can veto based on topic, not just the final surface form.
The SSRF through tools story is exactly why any HTTP tools in our world will be OS-mediated, on strict allow lists, and network-blocked from internal metadata and admin surfaces.
And the RCE in the sandbox chain is why code execution, where we do expose it, will run in very narrow containers with no creds, no network, and logging as if we were running untrusted code from the internet (because we are).
The big design shift for us has been what you’re implying here: don’t put all the “guardrails” in the LLM’s head and think that'll do it. I've already go the team discussing a clear thinker / cop split – one agent focused on reasoning and helping, another independent layer that looks at the whole trajectory and tool calls and decides “allow / modify / block” based on rules and separate models.
These LLM's are already so friggin powerful that this stuff is way too important to ship and hope for the best. There are always going to be people who live to break and misuse things... we have to build like we expect them, not like we forgot they exist.
Dude I can't tell you how much I appreciate you articulating the issues like this, it gives me a much sharper target to design against. Lol and once I have a beta version... you are my first call brother.
didn’t see my comment here so excuse if this is a double post.
totally appreciate the feedback here! i’m so glad it landed.
this is a super interested space to me. i’m hoping by demoing how easy it is to break it increases security.
would love to see it haha! great stuff John thanks again!