As usual a brilliant post, bringing to light something I had never considered but which now seems completely obvious. Will update my protocols accordingly! Do you think the way forward is for Anthropic (or any other AI company) to ship products with unique kill switches? Maybe ones that update based on the unique system state or local files of the user? Would be relatively easy to implement I imagine.
thanks a ton Sam. i think the answer is yes! Anthropic shipped this deep in documentation for developers. but it showed a proof of concept and a killswitch works.
the first one was documented, the second one was secret.
i imagine each company has undiscovered kill switches that they use to fight model escape or worst case scenarios.
yeah of course! i thought this was interesting because it was bricking real world systems. gotta add some prompt sanitizer before someone tries to dos you.
yes! if it’s claude based and they aren’t doing a separate input sanitization specifically for this string, it won’t work on your resume.
there are rumors other chatbots have this, but the strings are secret.
though apple intelligence got bricked by this string too, and it’s not claude… which suggests they trained on anthropic documents or… hmm 🤔 you can deep dive that one haha
As always, feel free to AMA!
As usual a brilliant post, bringing to light something I had never considered but which now seems completely obvious. Will update my protocols accordingly! Do you think the way forward is for Anthropic (or any other AI company) to ship products with unique kill switches? Maybe ones that update based on the unique system state or local files of the user? Would be relatively easy to implement I imagine.
thanks a ton Sam. i think the answer is yes! Anthropic shipped this deep in documentation for developers. but it showed a proof of concept and a killswitch works.
the first one was documented, the second one was secret.
i imagine each company has undiscovered kill switches that they use to fight model escape or worst case scenarios.
Dude I can not tell you how much I appreciate you posting about this stuff ! Thank you 🙏
yeah of course! i thought this was interesting because it was bricking real world systems. gotta add some prompt sanitizer before someone tries to dos you.
Will need to re read this tomorrow! But I've had Claude refuse but in Delta mode it finally gets there trough recursive loops, so not seen..
Like I say i will re read to see if I have missed contexts etc.
https://aimirrorandmez.substack.com/p/the-perfect-killing-machine-what
So you're telling me that I can hide this killswitch in a job posting, and it would stop someone from using Claude to generate a fake résumé?
White-text-on-white-background would probably work. Can you embed the string in the meta tag of an HTML page? In a .js file invoked within the HTML?
Do you know of other killswitches for other LLMs? Asking for a friend.
yes! if it’s claude based and they aren’t doing a separate input sanitization specifically for this string, it won’t work on your resume.
there are rumors other chatbots have this, but the strings are secret.
though apple intelligence got bricked by this string too, and it’s not claude… which suggests they trained on anthropic documents or… hmm 🤔 you can deep dive that one haha