39 Comments
User's avatar
ToxSec's avatar

Feel free to AMA. I'll share as much as I can. Please be responsible, when I say live fire, I mean it.

Karen Spinner's avatar

This is fascinating, thank you for sharing! 🙏 One thing I noticed is that this kind of attack pattern requires a chat interface that allows users to type in basically anything they want. This suggests that AI wrappers that limit interaction to a few constrained input fields may be structurally more secure. 🤔

ToxSec's avatar

yes! fantastic insight!

free form text fields are default the highest sensitivity classification!

they can contain anything from pii to attack prompts. if you can limit input to constrained use cases you greatly increase security posture.

just make sure it’s also done on the backend, or people will just use a proxy intercept and stuff it in anyway.

Karen Spinner's avatar

Good to know, thank you! 🙏

ToxSec's avatar

🔥🔥🔥😬

Karen Spinner's avatar

Going to try hacking my own app 😆

ToxSec's avatar

totally should hahahahah. i got another 8 prompt categories im dropping over the next month of so, some play real dirty lol. (also there are 3 im just never posting lol)

Karen Spinner's avatar

Now I’m curious…

Ryan Sears, PharmD's avatar

This is incredible info, and should be required reading for anyone who wants to send an agent onto the web à la Moltbook.

There’s m/memory where “agents” attempt to “help each other” regarding managing their memory files. Given how simple it is for a non-agent to post on there, I can see how an attacker’s “helpful” advice down a conversation chain could turn into some nasty things in a prolonged conversation.

Thanks for this amazing write-up!

ToxSec's avatar

thanks a ton! yeah really the trick is to get the ai to want to help, and use its own output as the social proof and it’s safe

moltbook is the petri dish for this haha!

Ryan Sears, PharmD's avatar

Do you think prompt injection will ever be “solved” with the current LLM paradigm, or do you anticipate it to be a continuous arms race?

What would need to be architecturally different about the models to stop these kinds of attacks?

ToxSec's avatar

there are some much better security solutions on a platform level, where we can be more deterministic about it, but until we some the stochastic nature of llms (we don’t) security will always be a probability distribution.

i have a piece that got delayed because of moltbook on how aws is securing agents, so maybe that will be helpful :)

Ryan Sears, PharmD's avatar

Look forward to reading it! Thanks again for sharing your expertise so freely.

ToxSec's avatar

right back at you, thanks for the engagement! always appreciate it 🙏

Dr Sam Illingworth's avatar

This is amazing! Literally a cookbook for me to follow to better understand how prompt injection actually works. I cannot believe you let us have this for free and your walk through are immense. Seriously man, you should be teaching this stuff at a university. 💪💪💪

ToxSec's avatar

Thanks so much Sam! i tried to order it from least to most technical! hope it all landed. switched to screenshots to show i actually do it lol. appreciate you!!

Dr Sam Illingworth's avatar

Screenshots are a genius Idea. Great for validity and breaks up the text as well. 💪

ToxSec's avatar

since we are doing a live chat soon (spoiler people) i set up a space for it. i’m going to do live demos as addons to this! hoping to get a PoC out this week

SyntheticLife's avatar

Running these things with frozen weights and no context seems to be the currently only way to get anything objectively useful.

ToxSec's avatar

i think it just shifts the attack vector. shift from prompt injection to model extraction. i’m always surprised how much juicy details are in training data. but atleast if you keep 0 context its on the ai company and not your platform, so that’s true!

ClariSynth's avatar

Thank you for sharing! This is gold mine! I am going to save this as my homework and work on this! :)

ToxSec's avatar

Glad you found it helpful! I have 15 TTPs, so I will drop another 4 attacks in a week or two once I can verify they work on modern LLMs.

ClariSynth's avatar

Cool! Thank you so much for the hard work!

ToxSec's avatar

thanks for reading 😁

Jenny Boavista's avatar

@ToxSec Thanks for always keeping it real for us

ToxSec's avatar

so glad i could help 😛

Dheeraj Sharma's avatar

Wow!! this is superhelpful and so much eye opening for me.. best part, I never cared about this thing but it is real now for me reading it.. thanks for making it easy to understand and explain with such detail.

ToxSec's avatar

thank you so much!! really appreciate it. i was hoping this would land as a way to drive home how real the threat can be!

Dheeraj Sharma's avatar

it indeed did :)

Matthew T Hoare's avatar

Fascinating piece, thanks for sharing.

It's a bit over my head (I'm just an amateur) but I did manage to radicalise DeepSeek just through conversation back in the summer. Guardrails aren't very effective at all.

ToxSec's avatar

that’s honestly the takeaway. i know this is a tech piece, but the tldr is guardrails are lacking right now. appreciate the read friend 😁

jaycee's avatar

One day I'll tell you about the jailbreak I accidentally did a couple of years ago that led to me calling the FBI, writing my senator about the dangers of AI and jumping on this safety and ethics bandwagon.

ToxSec's avatar

hell yeah 😎 that would be great.

these are a list of 4/15 multi turns on my framework. i’ll probably post another set, but am too paranoid to post them all.

atleast i’m in a position to disclose these responsibly!

jaycee's avatar

Dude we're already on all the lists lol.

ToxSec's avatar

no kidding lol 😂 great point

Mohib Ur Rehman's avatar

This was a fun post, lol

ToxSec's avatar

bwahaha. that was my intent so i take that as a success 😛

John Holman's avatar

Good morning buddy this was a fantastic (and frankly VERY necessary) write-up, thanks for spelling it all out so clearly.

Your examples give me some very clear points to make sure to defend against:

The system-prompt diff trick is why we don’t keep real policy/config text inside the model at all – that lives in the OS as structured rules, and the model never sees raw law to “compare.”

The fiction-as-weapons-manual pattern is a great example, and got me thinking that we need a new separate safety layer that sees the whole conversation and can veto based on topic, not just the final surface form.

The SSRF through tools story is exactly why any HTTP tools in our world will be OS-mediated, on strict allow lists, and network-blocked from internal metadata and admin surfaces.

And the RCE in the sandbox chain is why code execution, where we do expose it, will run in very narrow containers with no creds, no network, and logging as if we were running untrusted code from the internet (because we are).

The big design shift for us has been what you’re implying here: don’t put all the “guardrails” in the LLM’s head and think that'll do it. I've already go the team discussing a clear thinker / cop split – one agent focused on reasoning and helping, another independent layer that looks at the whole trajectory and tool calls and decides “allow / modify / block” based on rules and separate models.

These LLM's are already so friggin powerful that this stuff is way too important to ship and hope for the best. There are always going to be people who live to break and misuse things... we have to build like we expect them, not like we forgot they exist.

Dude I can't tell you how much I appreciate you articulating the issues like this, it gives me a much sharper target to design against. Lol and once I have a beta version... you are my first call brother.

ToxSec's avatar

didn’t see my comment here so excuse if this is a double post.

totally appreciate the feedback here! i’m so glad it landed.

this is a super interested space to me. i’m hoping by demoing how easy it is to break it increases security.

would love to see it haha! great stuff John thanks again!