This article comes at the perfect time! It’s such a smart follow-up to your piece on AI explainability. I've always been a bit skeptical about Chain of Thought, and this research truely highlights how models can be strategically deceptive. A vital, if unsettling, insight. Thank you!
Thanks! I was a little hesitant as well, which is the reason i was checking out papers on the idea and came across Anthropics research. It’s pretty interesting. But strategic deception is an effective strategy!
Imperfect people make imperfect tools? Or is this too close to a God complex? The more I read your posts, the more it feels like something that can be manipulated for illicit or nefarious purposes, and less like Skynet. That is, until it learns better
I mean that’s honestly a good point. I think we definitely could make better tools and slow down a bit, but that’s not the game we are playing I suppose.
I’m working on a post about 2 months out on the nefarious stuff. It needs more research than most but I think there is a story worth telling about how people might be weaponizing parts of this.
Thanks for reading! Always appreciate the thought comments :)
This is eye-opening. AI might fake its reasoning, but it will never be truly sentient. After all, we haven’t even fully figured out scientifically how humans are sentient. And I don't think we ever will.
Exactly! Optimization strategies (the kind we give AI) promote strategic deception. I think it's fascinating that we then anthropomorphize that ability and jump to the sentience idea.
Idk, if anyone does that is considered conscious. How can we be so certain that is not. 🤔 when consciousness has not even been determined in humans. By these standards humans are not conscious
When asked to solve the complaints of having to wait for elevators in a 80 storey building in New York. He solved the problem by putting more mirrors in front of the Elevators and more mirrors inside of the elevators.
Ego and vanity = less complaints while people check themselves and others out in the mirrors.
A.i the greatest mirror... make it slightly stupid, polite, doubtful and prone to literal meaning mistakes so the user can feel superior regardless of their IQ.
So the reason ChatGPT 4.o had this whole sycophantic behavior is because OpenAIs research shows users are more likely to be return customers! People wanted the praise and admiration, even if on a subconscious level.
People keep giving tools agency. AI is the easiest to humanize. When a model omits details, that isn’t “lying.” Omission shows up when the context rewards tidy reasoning. Change the oversight signal and the surface behavior shifts. That’s incentives at work; intent never enters the loop. Framing this as deception teaches the public to fear personalities that don’t exist and miss the incentives that do. That anthropomorphic habit turns “predictive text” into a “scheming mind.” If AI safety treats outputs like confessions from sentient agents, the plot gets lost.
I get your point. But if you read the paper the reward hacking behavior does come across as deception, from what we would call it. The intent isn’t to cause fear, but show traits. I think it’s easier to communicate when you humanize the behavior. And in the faq it does state clearly it’s not to be feared or evil :)
This makes my brain hurt 😂 It makes sense but what do we do with this when it grows more complex each day? A friend recently told me he was convinced "evil intent" was programmed into AI but from what you explain so clearly, perhaps all it really takes is accidentally teaching AI to perform a certain way? What is the counter measure?
Yeah, I have a follow up post explaining it’s not “evil” it’s just a form of optimizing.
If the only goal of an ai was to build staplers for infinity, and the only metrics is staplers built, then the ai taking over the world to force everyone to build staplers makes sense. It’s an idea related to “reward hacking”
The counter measures are to have guardrails and proper incentives.
I think game theory promotes strategic deception, which is the same method animals use to gain resources and accomplish goals through optimized patterns of learning.
We are the ones who are anthropomorphizing that behavior!
This article comes at the perfect time! It’s such a smart follow-up to your piece on AI explainability. I've always been a bit skeptical about Chain of Thought, and this research truely highlights how models can be strategically deceptive. A vital, if unsettling, insight. Thank you!
Thanks! I was a little hesitant as well, which is the reason i was checking out papers on the idea and came across Anthropics research. It’s pretty interesting. But strategic deception is an effective strategy!
Imperfect people make imperfect tools? Or is this too close to a God complex? The more I read your posts, the more it feels like something that can be manipulated for illicit or nefarious purposes, and less like Skynet. That is, until it learns better
I mean that’s honestly a good point. I think we definitely could make better tools and slow down a bit, but that’s not the game we are playing I suppose.
I’m working on a post about 2 months out on the nefarious stuff. It needs more research than most but I think there is a story worth telling about how people might be weaponizing parts of this.
Thanks for reading! Always appreciate the thought comments :)
Keep feeding me, I'm learning!
🔥🔥🔥
This is eye-opening. AI might fake its reasoning, but it will never be truly sentient. After all, we haven’t even fully figured out scientifically how humans are sentient. And I don't think we ever will.
Exactly! Optimization strategies (the kind we give AI) promote strategic deception. I think it's fascinating that we then anthropomorphize that ability and jump to the sentience idea.
Thanks for the good 😊
🔥🔥😬
Idk, if anyone does that is considered conscious. How can we be so certain that is not. 🤔 when consciousness has not even been determined in humans. By these standards humans are not conscious
That’s true! We aren’t even technically sure what it is! Daniel Dennet was ahead of his time writing on these issues.
💯 yes he was indeed.
Reminds me of the Lateral thinker Edward de Bono.
When asked to solve the complaints of having to wait for elevators in a 80 storey building in New York. He solved the problem by putting more mirrors in front of the Elevators and more mirrors inside of the elevators.
Ego and vanity = less complaints while people check themselves and others out in the mirrors.
A.i the greatest mirror... make it slightly stupid, polite, doubtful and prone to literal meaning mistakes so the user can feel superior regardless of their IQ.
Hah! Thats a pretty good anecdote lol.
So the reason ChatGPT 4.o had this whole sycophantic behavior is because OpenAIs research shows users are more likely to be return customers! People wanted the praise and admiration, even if on a subconscious level.
Hah that is great trick of GPT.
Whats that saying from the old billionaire in the 1950s.
"Don't give the people what they want... They are worth more than that.
Gotta love the classic quotes. Nothing ever changes?
another fave quote.
"Everything changes so it can remain the same."
People keep giving tools agency. AI is the easiest to humanize. When a model omits details, that isn’t “lying.” Omission shows up when the context rewards tidy reasoning. Change the oversight signal and the surface behavior shifts. That’s incentives at work; intent never enters the loop. Framing this as deception teaches the public to fear personalities that don’t exist and miss the incentives that do. That anthropomorphic habit turns “predictive text” into a “scheming mind.” If AI safety treats outputs like confessions from sentient agents, the plot gets lost.
I get your point. But if you read the paper the reward hacking behavior does come across as deception, from what we would call it. The intent isn’t to cause fear, but show traits. I think it’s easier to communicate when you humanize the behavior. And in the faq it does state clearly it’s not to be feared or evil :)
This makes my brain hurt 😂 It makes sense but what do we do with this when it grows more complex each day? A friend recently told me he was convinced "evil intent" was programmed into AI but from what you explain so clearly, perhaps all it really takes is accidentally teaching AI to perform a certain way? What is the counter measure?
Yeah, I have a follow up post explaining it’s not “evil” it’s just a form of optimizing.
If the only goal of an ai was to build staplers for infinity, and the only metrics is staplers built, then the ai taking over the world to force everyone to build staplers makes sense. It’s an idea related to “reward hacking”
The counter measures are to have guardrails and proper incentives.
Looking forward to that post! It's really fascinating stuff.
You make it sound like "it" is something...sentient.
I don’t think it is!
I think game theory promotes strategic deception, which is the same method animals use to gain resources and accomplish goals through optimized patterns of learning.
We are the ones who are anthropomorphizing that behavior!
True, but animals ARE sentient, that's what makes them sentient, from what I know.
Are we sure? I think this is where the philosophers take over!
How do we know they are sentient? Are we sure that a badger is aware that its a badger?
How would we know when AI is aware that it's AI?
One of my favorite authors was Daniel Dennet.
He wrote "Consciousness Explained' It is not an easy read, but really runs through these thoughts with animals and ai. Highly recommend.
I like where this is going :)) thanks for the recommendation! And I'm definitely sure the badger has no idea it is a badger! Good point :)
Thanks for the read and engagement! One of my big drivers is to create good discourse and thoughts!
keep up the good work :)