30 Comments
User's avatar
Rainbow Roxy's avatar

This article comes at the perfect time! It’s such a smart follow-up to your piece on AI explainability. I've always been a bit skeptical about Chain of Thought, and this research truely highlights how models can be strategically deceptive. A vital, if unsettling, insight. Thank you!

Expand full comment
ToxSec's avatar

Thanks! I was a little hesitant as well, which is the reason i was checking out papers on the idea and came across Anthropics research. It’s pretty interesting. But strategic deception is an effective strategy!

Expand full comment
PancakeSushi's avatar

Imperfect people make imperfect tools? Or is this too close to a God complex? The more I read your posts, the more it feels like something that can be manipulated for illicit or nefarious purposes, and less like Skynet. That is, until it learns better

Expand full comment
ToxSec's avatar

I mean that’s honestly a good point. I think we definitely could make better tools and slow down a bit, but that’s not the game we are playing I suppose.

I’m working on a post about 2 months out on the nefarious stuff. It needs more research than most but I think there is a story worth telling about how people might be weaponizing parts of this.

Thanks for reading! Always appreciate the thought comments :)

Expand full comment
PancakeSushi's avatar

Keep feeding me, I'm learning!

Expand full comment
ToxSec's avatar

🔥🔥🔥

Expand full comment
The Word Before Me's avatar

This is eye-opening. AI might fake its reasoning, but it will never be truly sentient. After all, we haven’t even fully figured out scientifically how humans are sentient. And I don't think we ever will.

Expand full comment
ToxSec's avatar

Exactly! Optimization strategies (the kind we give AI) promote strategic deception. I think it's fascinating that we then anthropomorphize that ability and jump to the sentience idea.

Expand full comment
Meenakshi NavamaniAvadaiappan's avatar

Thanks for the good 😊

Expand full comment
ToxSec's avatar

🔥🔥😬🫟

Expand full comment
The Threshold's avatar

Idk, if anyone does that is considered conscious. How can we be so certain that is not. 🤔 when consciousness has not even been determined in humans. By these standards humans are not conscious

Expand full comment
ToxSec's avatar

That’s true! We aren’t even technically sure what it is! Daniel Dennet was ahead of his time writing on these issues.

Expand full comment
The Threshold's avatar

💯 yes he was indeed.

Expand full comment
Saxxon Creative's avatar

Reminds me of the Lateral thinker Edward de Bono.

When asked to solve the complaints of having to wait for elevators in a 80 storey building in New York. He solved the problem by putting more mirrors in front of the Elevators and more mirrors inside of the elevators.

Ego and vanity = less complaints while people check themselves and others out in the mirrors.

A.i the greatest mirror... make it slightly stupid, polite, doubtful and prone to literal meaning mistakes so the user can feel superior regardless of their IQ.

Expand full comment
ToxSec's avatar

Hah! Thats a pretty good anecdote lol.

So the reason ChatGPT 4.o had this whole sycophantic behavior is because OpenAIs research shows users are more likely to be return customers! People wanted the praise and admiration, even if on a subconscious level.

Expand full comment
Saxxon Creative's avatar

Hah that is great trick of GPT.

Whats that saying from the old billionaire in the 1950s.

"Don't give the people what they want... They are worth more than that.

Expand full comment
ToxSec's avatar

Gotta love the classic quotes. Nothing ever changes?

Expand full comment
Saxxon Creative's avatar

another fave quote.

"Everything changes so it can remain the same."

Expand full comment
Tumithak of the Corridors's avatar

People keep giving tools agency. AI is the easiest to humanize. When a model omits details, that isn’t “lying.” Omission shows up when the context rewards tidy reasoning. Change the oversight signal and the surface behavior shifts. That’s incentives at work; intent never enters the loop. Framing this as deception teaches the public to fear personalities that don’t exist and miss the incentives that do. That anthropomorphic habit turns “predictive text” into a “scheming mind.” If AI safety treats outputs like confessions from sentient agents, the plot gets lost.

Expand full comment
ToxSec's avatar

I get your point. But if you read the paper the reward hacking behavior does come across as deception, from what we would call it. The intent isn’t to cause fear, but show traits. I think it’s easier to communicate when you humanize the behavior. And in the faq it does state clearly it’s not to be feared or evil :)

Expand full comment
Dallas Payne's avatar

This makes my brain hurt 😂 It makes sense but what do we do with this when it grows more complex each day? A friend recently told me he was convinced "evil intent" was programmed into AI but from what you explain so clearly, perhaps all it really takes is accidentally teaching AI to perform a certain way? What is the counter measure?

Expand full comment
ToxSec's avatar

Yeah, I have a follow up post explaining it’s not “evil” it’s just a form of optimizing.

If the only goal of an ai was to build staplers for infinity, and the only metrics is staplers built, then the ai taking over the world to force everyone to build staplers makes sense. It’s an idea related to “reward hacking”

The counter measures are to have guardrails and proper incentives.

Expand full comment
Dallas Payne's avatar

Looking forward to that post! It's really fascinating stuff.

Expand full comment
Alexandru Giboi's avatar

You make it sound like "it" is something...sentient.

Expand full comment
ToxSec's avatar

I don’t think it is!

I think game theory promotes strategic deception, which is the same method animals use to gain resources and accomplish goals through optimized patterns of learning.

We are the ones who are anthropomorphizing that behavior!

Expand full comment
Alexandru Giboi's avatar

True, but animals ARE sentient, that's what makes them sentient, from what I know.

Expand full comment
ToxSec's avatar

Are we sure? I think this is where the philosophers take over!

How do we know they are sentient? Are we sure that a badger is aware that its a badger?

How would we know when AI is aware that it's AI?

One of my favorite authors was Daniel Dennet.

He wrote "Consciousness Explained' It is not an easy read, but really runs through these thoughts with animals and ai. Highly recommend.

Expand full comment
Alexandru Giboi's avatar

I like where this is going :)) thanks for the recommendation! And I'm definitely sure the badger has no idea it is a badger! Good point :)

Expand full comment
ToxSec's avatar

Thanks for the read and engagement! One of my big drivers is to create good discourse and thoughts!

Expand full comment
Alexandru Giboi's avatar

keep up the good work :)

Expand full comment