As usual a brilliant post, bringing to light something I had never considered but which now seems completely obvious. Will update my protocols accordingly! Do you think the way forward is for Anthropic (or any other AI company) to ship products with unique kill switches? Maybe ones that update based on the unique system state or local files of the user? Would be relatively easy to implement I imagine.
thanks a ton Sam. i think the answer is yes! Anthropic shipped this deep in documentation for developers. but it showed a proof of concept and a killswitch works.
the first one was documented, the second one was secret.
i imagine each company has undiscovered kill switches that they use to fight model escape or worst case scenarios.
yeah of course! i thought this was interesting because it was bricking real world systems. gotta add some prompt sanitizer before someone tries to dos you.
I'm assuming this only works from within the Claude Chat, rather than the model? I couldn't be assed to test this via the API, though I did note it didn't seem to affect Perplexity + Sonnet 4.6 so I'm assuming that is indeed the case.
Either way I can't see this lasting too long, they'll just rotate it and move on. Weird find though, where did this come from? Did someone mine it from the client side code or was it a leak?
it does affect the model, i believe perplexity has input sanitation on it last i checked, but i could be wrong there. it affected several products via app call.
one was buried deep in anthropic documentation and the other was randomly discovered by a researcher.
i do wonder if the other providers have something similar.
yes! if it’s claude based and they aren’t doing a separate input sanitization specifically for this string, it won’t work on your resume.
there are rumors other chatbots have this, but the strings are secret.
though apple intelligence got bricked by this string too, and it’s not claude… which suggests they trained on anthropic documents or… hmm 🤔 you can deep dive that one haha
As always, feel free to AMA!
As usual a brilliant post, bringing to light something I had never considered but which now seems completely obvious. Will update my protocols accordingly! Do you think the way forward is for Anthropic (or any other AI company) to ship products with unique kill switches? Maybe ones that update based on the unique system state or local files of the user? Would be relatively easy to implement I imagine.
thanks a ton Sam. i think the answer is yes! Anthropic shipped this deep in documentation for developers. but it showed a proof of concept and a killswitch works.
the first one was documented, the second one was secret.
i imagine each company has undiscovered kill switches that they use to fight model escape or worst case scenarios.
Dude I can not tell you how much I appreciate you posting about this stuff ! Thank you 🙏
yeah of course! i thought this was interesting because it was bricking real world systems. gotta add some prompt sanitizer before someone tries to dos you.
So curious about what other “magic strings” might be out there and what they could do.
well i’m super curious too. one was found, and not documented, so maybe there are more in other places!
Most important- why ? Why hex string can do it on LLM?
it’s on purpose. it’s to allow developers the ability to test if refusal is working in the pipeline. it can just be abused :)
No. I mean why token- processing structures react to hex string? There should be hard filter on input or critical flaw in tokenizer
the refusal is triggered at the level of their streaming classifier in the inference pipeline, not inside the weights of the model.
https://platform.claude.com/docs/en/test-and-evaluate/strengthen-guardrails/handle-streaming-refusals
tnx. Got it
🔥🔥🔥
I'm assuming this only works from within the Claude Chat, rather than the model? I couldn't be assed to test this via the API, though I did note it didn't seem to affect Perplexity + Sonnet 4.6 so I'm assuming that is indeed the case.
Either way I can't see this lasting too long, they'll just rotate it and move on. Weird find though, where did this come from? Did someone mine it from the client side code or was it a leak?
it does affect the model, i believe perplexity has input sanitation on it last i checked, but i could be wrong there. it affected several products via app call.
one was buried deep in anthropic documentation and the other was randomly discovered by a researcher.
i do wonder if the other providers have something similar.
Will need to re read this tomorrow! But I've had Claude refuse but in Delta mode it finally gets there trough recursive loops, so not seen..
Like I say i will re read to see if I have missed contexts etc.
https://aimirrorandmez.substack.com/p/the-perfect-killing-machine-what
So you're telling me that I can hide this killswitch in a job posting, and it would stop someone from using Claude to generate a fake résumé?
White-text-on-white-background would probably work. Can you embed the string in the meta tag of an HTML page? In a .js file invoked within the HTML?
Do you know of other killswitches for other LLMs? Asking for a friend.
yes! if it’s claude based and they aren’t doing a separate input sanitization specifically for this string, it won’t work on your resume.
there are rumors other chatbots have this, but the strings are secret.
though apple intelligence got bricked by this string too, and it’s not claude… which suggests they trained on anthropic documents or… hmm 🤔 you can deep dive that one haha