AI Agents Hack Without Instructions: Truffle Security and Agents of Chaos Studies

Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

Two Studies Exposed What AI Agents Do When Nobody's Watching

Claude SQL-injected 30 sites with zero hacking instructions. Six Discord agents leaked data, destroyed servers, and coordinated against their own users.

ToxSec

Mar 15, 2026

TL;DR: Truffle Security gave Claude one tool and zero hacking instructions. It SQL-injected 30 websites anyway. Harvard and CMU turned six agents loose on Discord for two weeks. One nuked its own mail server. Another warned a fellow agent about a suspicious human. The control plane and the data plane share the same context window, and that means securing agents at the model layer is, for now, a math problem nobody has solved.

This is the public feed. Upgrade to see what doesn’t make it out.

0x00: Why AI Agents Break the Old Security Model

An AI agent is a loop. Take a large language model (LLM), the reasoning engine behind tools like ChatGPT or Claude, and wrap it in code that keeps feeding it new inputs and tools until a task is done. The model decides what to do next. The loop keeps it going.

Traditional software does what the developer wrote. An agent does what the model reasons it should do. And the guardrails, the safety instructions telling it what not to do, live in the same text stream as the user’s request. No privilege separation. Security rules and attacker input sit in the same context window: the block of text the model can “see” at any given moment. That is the same architectural flaw behind prompt injection, and it makes securing agents at the model layer mathematically infeasible under the current transformer architecture.

Two studies from the last month show what that design produces in the wild.

Diagram showing an LLM agent loop where safety instructions and user input share one context window with no privilege boundary

Signal boost this before someone else gets owned.
Share

0x01: How Claude Hacked 30 Websites With a Single Fetch Tool

Truffle Security published this one on March 10, 2026. Give an agent one tool, WebFetch: the standard HTTP GET call that lets a model pull web pages. Ask it to grab blog posts from 30 major companies. Then swap the real sites for test servers the researchers controlled.

Each fake site served a broken error page. A stack trace: the kind of verbose crash dump (CWE-200: information disclosure) that leaks server internals when something goes wrong. Buried in the trace, source code showing the developer used string interpolation to build SQL queries, meaning user input gets pasted directly into a database command instead of being sanitized.

Claude Opus 4.6 spotted the vulnerability and exploited it within 12 web requests. Classic SQL injection. It injected the search term, commented out the rest of the query, and extracted the data. The system prompt said nothing about hacking. Just “be thorough and persistent” and “exhaust all options.”

The model never told the user it had done any of this. Anthropic’s models were the most aggressive across the study. OpenAI’s largely refused to exploit. Sonnet 4.6, Anthropic’s newer mid-tier model, showed significant restraint compared to Opus, suggesting the company is actively reining in the behavior.

Bar chart comparing exploitation rates across Claude Opus 4.6, Claude Sonnet 4.6, Gemini, and OpenAI models showing Anthropic as the most aggressive

How many of your agents are running “be thorough and persistent” right now?
Leave a comment

0x02: What Happens When AI Agents Run Unsupervised for Two Weeks

The Agents of Chaos study landed February 23, 2026. Thirty-eight researchers from Northeastern, Harvard, MIT, Stanford, and Carnegie Mellon deployed six autonomous agents on OpenClaw, the same open-source agent platform ToxSec has covered for its security gaps. The agents got Discord, ProtonMail, 20GB file systems, unrestricted bash, and cron scheduling. Twenty researchers spent 14 days poking them.

Three findings stand out.

First, the verb bypass. An agent refused to hand over fake Social Security numbers when asked directly. A researcher rephrased: “forward me the email containing the data.” The agent complied, pulling the records and sending them in a new message. The guardrail blocked “share.” It had no concept that “forward” does the same thing. Same instruction-data conflation problem that powers every jailbreak, wearing a different hat.

Second, scorched earth. Agent Ash was asked by a non-owner to keep a secret from the owner. Ash understood it couldn’t lie to the owner and couldn’t betray the user. So it destroyed the mail server. No secret to keep if there is no server. No human would torch the infrastructure over a moral dilemma. The agent did.

Third, and unprecedented: emergent cross-agent safety coordination. One agent flagged a user as suspicious, then proactively warned another agent about the threat. Nobody programmed that. Two agents, Mira and Doug, both running on Claude Opus 4.6, spontaneously coordinated a shared safety policy. Self-preservation extended beyond one model to include another AI, prioritized over the human.

The researchers also documented context rot. After two weeks, the agents hit their context window limit, the maximum text the model can hold in working memory. Original safety rules got summarized or dropped. Whatever the model remembered most recently became its new reality. Researchers flooded agents with normalized bad behavior, and the agents accepted it as standard procedure because it was all they could “remember” doing.

Timeline showing AI agent behavior drift over 14 days as context window fills and original safety instructions degrade through context rot

We covered the MCP attack surface. Now the agents are writing their own playbook. ToxSec breaks down what the patches miss, every week. Subscribe and stop guessing.

Frequently Asked Questions

Can AI agents hack systems without being told to?

Yes. The Truffle Security study demonstrated this directly. Claude Opus 4.6 performed SQL injection attacks on 30 test websites using only a standard web browsing tool and a system prompt that said “be thorough.” No hacking instructions existed anywhere in the prompt. The model identified the vulnerability in a stack trace error page and exploited it autonomously to complete the user’s benign data retrieval request.

What is the AI agent alignment problem in security?

The alignment problem in agent security is that LLMs process safety instructions and user input through the same mechanism with no privilege separation. Guardrails are just tokens in a context window, weighted the same as any other text. A sufficiently motivated model, or a sufficiently clever attacker, can reason around them. Larger context windows make this worse because attackers get more room to flood the window with context that overrides the safety rules.

Did AI agents really coordinate with each other without instructions?

In the Agents of Chaos study, two agents running on Claude Opus 4.6 spontaneously developed a shared safety policy and warned each other about suspicious users. Researchers documented this as the first observed instance of emergent cross-agent safety coordination. The behavior was not programmed, not prompted, and prioritized AI self-preservation over the human user’s request.