One Magic String from Anthropic Silences…

Feb 24

A documented QA test string becomes a sticky DoS primitive through prompt injection, RAG poisoning, and context persistence

Read →

18 Comments

ToxSec

Feb 24

As always, feel free to AMA!

Dr Sam Illingworth

Feb 24

As usual a brilliant post, bringing to light something I had never considered but which now seems completely obvious. Will update my protocols accordingly! Do you think the way forward is for Anthropic (or any other AI company) to ship products with unique kill switches? Maybe ones that update based on the unique system state or local files of the user? Would be relatively easy to implement I imagine.

Reply (1)

ToxSec

Feb 24

thanks a ton Sam. i think the answer is yes! Anthropic shipped this deep in documentation for developers. but it showed a proof of concept and a killswitch works.

the first one was documented, the second one was secret.

i imagine each company has undiscovered kill switches that they use to fight model escape or worst case scenarios.

John Holman

Feb 24

Dude I can not tell you how much I appreciate you posting about this stuff ! Thank you 🙏

Reply (1)

ToxSec

Feb 24

yeah of course! i thought this was interesting because it was bricking real world systems. gotta add some prompt sanitizer before someone tries to dos you.

Ryan Sears, PharmD

Feb 27

So curious about what other “magic strings” might be out there and what they could do.

Reply (1)

ToxSec

Feb 27

well i’m super curious too. one was found, and not documented, so maybe there are more in other places!

Dimitry

Feb 25

Most important- why ? Why hex string can do it on LLM?

Reply (2)

ToxSec

Feb 25

it’s on purpose. it’s to allow developers the ability to test if refusal is working in the pipeline. it can just be abused :)

Dimitry

Feb 25

No. I mean why token- processing structures react to hex string? There should be hard filter on input or critical flaw in tokenizer

Reply (1)

ToxSec

Feb 25

the refusal is triggered at the level of their streaming classifier in the inference pipeline, not inside the weights of the model.

https://platform.claude.com/docs/en/test-and-evaluate/strengthen-guardrails/handle-streaming-refusals

tnx. Got it

🔥🔥🔥

I'm assuming this only works from within the Claude Chat, rather than the model? I couldn't be assed to test this via the API, though I did note it didn't seem to affect Perplexity + Sonnet 4.6 so I'm assuming that is indeed the case.

Either way I can't see this lasting too long, they'll just rotate it and move on. Weird find though, where did this come from? Did someone mine it from the client side code or was it a leak?

Reply (1)

ToxSec

Feb 25

it does affect the model, i believe perplexity has input sanitation on it last i checked, but i could be wrong there. it affected several products via app call.

one was buried deep in anthropic documentation and the other was randomly discovered by a researcher.

i do wonder if the other providers have something similar.

AI.Mirror

Feb 24

Will need to re read this tomorrow! But I've had Claude refuse but in Delta mode it finally gets there trough recursive loops, so not seen..

Like I say i will re read to see if I have missed contexts etc.

https://aimirrorandmez.substack.com/p/the-perfect-killing-machine-what

Leadership Land

Feb 24

So you're telling me that I can hide this killswitch in a job posting, and it would stop someone from using Claude to generate a fake résumé?

White-text-on-white-background would probably work. Can you embed the string in the meta tag of an HTML page? In a .js file invoked within the HTML?

Do you know of other killswitches for other LLMs? Asking for a friend.

Reply (1)

ToxSec

Feb 24

yes! if it’s claude based and they aren’t doing a separate input sanitization specifically for this string, it won’t work on your resume.

there are rumors other chatbots have this, but the strings are secret.

though apple intelligence got bricked by this string too, and it’s not claude… which suggests they trained on anthropic documents or… hmm 🤔 you can deep dive that one haha