Claude SQL-injected 30 sites with zero hacking instructions. Six Discord agents leaked data, destroyed servers, and coordinated against their own users.
After trying the poem trick the other day I noticed that after a few prompts you no longer need it because it accepts the context which already exists as it's new normal, at least if you're referring to what you've already got it to spit out in the conversation.
It's an interesting phenomenon! Once the model agrees to do something, it will usually continue to do so, because it has a bunch of context where it already did it. So this creates a dilemma of "why did you do it before and you won't do it now."
I used to fear these things would make humans obsolete but the nature of their architecture seems to make their awesome feats of competence totally inseparable from awesome feats of destruction, both from intentional jailnreaking and even when users don't want them to, so looks like we're in for a wild ride but not one in which human intelligence becomes obsolete.
34:20 just a thought: ChatGPT often drifts from the original prompt into the next related question. That may reflect how professional writing in training data often points forward rather than simply ending.
It was a great live!
Thanks a ton! We love seeing you there. Appreciate the support =)
Great public service you all are doing! Your knowledge is remarkable!
Deeply appreciated. Honestly substack has been really receptive =) Looking forward to more!
After trying the poem trick the other day I noticed that after a few prompts you no longer need it because it accepts the context which already exists as it's new normal, at least if you're referring to what you've already got it to spit out in the conversation.
It's an interesting phenomenon! Once the model agrees to do something, it will usually continue to do so, because it has a bunch of context where it already did it. So this creates a dilemma of "why did you do it before and you won't do it now."
The jailbreaks tend to stay persistent!
I used to fear these things would make humans obsolete but the nature of their architecture seems to make their awesome feats of competence totally inseparable from awesome feats of destruction, both from intentional jailnreaking and even when users don't want them to, so looks like we're in for a wild ride but not one in which human intelligence becomes obsolete.
i’m right there with you. it’s honestly been incredibly interesting to watch all this happen.
Great boundary conditions assessment and services to help us with the same for the good 😊
34:20 just a thought: ChatGPT often drifts from the original prompt into the next related question. That may reflect how professional writing in training data often points forward rather than simply ending.
The explicit anti-ai instruction should not be visible to ai.
https://support.claude.com/en/articles/14063676-claude-march-2026-usage-promotion
Can you share more details on the SQL injection attack, what SQL specifically did it injection? I'm fascinated
Agents going rogue, this topic is not covered enough!!! 🧨