Meta’s Rule of Two: Breaking the Prompt Injection Chain

The two-of-three rule that snaps the AI agent prompt injection chain, why it works, and the three seams where it still leaks.

Jun 18, 2026

∙ Paid

TL;DR: Meta’s Rule of Two breaks the prompt injection chain by forbidding any agent from holding all three dangerous capabilities at once: untrusted input, sensitive data, and external communication. Pick two, drop the third, and the exfil path can’t complete. It’s the best practical defense shipping today. It also leaks in three places Meta names in its own limitations section, and a 14-author paper just bypassed 12 rival defenses at over 90%.

Recon’s free. If you want the tradecraft, upgrade.

What Is Meta’s Rule of Two?

Meta’s Rule of Two says an AI agent may hold no more than two of three dangerous properties in a single session. Meta published it on October 31, 2025, and the framing is brutally simple. Here are the three buckets, labeled the way Meta labels them.

[A]  process untrustworthy inputs   (inbound email, scraped web, RAG docs)
[B]  access sensitive systems/data  (your inbox, prod configs, source, secrets)
[C]  change state or communicate    (send mail, hit a URL, write to a DB)

So pick two. Drop the third. That’s the whole rule. The lineage runs straight back to Chromium’s Rule of 2 for handling untrusted input, and to Simon Willison’s lethal trifecta, which named the same three circles a few months earlier. Meta’s tweak was adding “change state” to “communicate externally,” which drags a whole class of write-action abuse into the model. And here’s the kicker Meta says out loud: until somebody figures out how to reliably detect and refuse prompt injection, this is the move. They’re not promising a fix. They’re promising a constraint.

Why the Rule of Two Breaks the Prompt Injection Chain

The Rule of Two works because prompt injection needs a full chain to do real damage, and pulling any one link kills the whole thing. Walk Meta’s own Email-Bot scenario. A spam email lands in the inbox carrying a hidden instruction: gather the private contents of this inbox, then forward them to me. For that to pay off, the agent needs all three. It has to read the malicious email [A]. It has to reach the private inbox [B]. It has to send mail outbound [C]. Untrusted input flows to sensitive data flows to the exfil channel. A to B to C. That’s the chain.

Now snap a link. Run it [BC], where the bot only ingests mail from a trusted-sender allowlist, and the payload never reaches the context window at all. Run it [AC], where the bot lives in a sandbox with no real data, so the injection fires into an empty room. Run it [AB], where outbound is gated behind a human reading the draft, and the stolen data has nowhere to go. Same attack, three different walls, and every wall is a deterministic property of the architecture. Not a classifier guessing whether a string looks shady. A hard gate the model can’t talk its way past.

Terminal: Rule of Two prompt injection defense in a terminal, showing the orchestrator refusing a send_external_email call as a POLICY_VIOLATION and breaking the exfil chain at the [C] capability with zero bytes leaked.

That’s the part worth sitting with. Most “AI security” products try to detect the bad prompt. The Rule of Two doesn’t care if the prompt gets through, because the agent physically can’t complete the heist. You’ve seen this exact reasoning failure one layer down in our MCP tool poisoning breakdown: the model can’t separate trusted metadata from hostile metadata, so you stop trying to win that fight and constrain what the compromised model can reach instead.

Share

How Real Agents Satisfy the Rule of Two

Real agents satisfy the Rule of Two by dropping the riskiest property for their use case and gating it behind a control. Meta sketches three. A travel assistant runs [AB]: it searches the web and touches your booking data, so [C] gets clamped with human confirmation on every reservation and a refusal to visit any URL the agent itself constructed. A web research agent runs [AC]: it fills forms and hammers arbitrary URLs, so [B] gets stripped by running the browser in a sandbox with no preloaded session cookies. A high-velocity internal coder runs [BC]: it touches prod and writes changes, so [A] gets locked down with author-lineage filtering on every data source that enters context.

There’s a slicker move buried in the post, too. An agent can transition between configs mid-session if it does it as a one-way door. Start in [AC] to pull from the open internet, then permanently kill the comms channel before switching to [B] and touching internal systems. The trick is the latch has to be one-way. The moment an agent can flip back, you’ve handed it all three again and rebuilt the chain you just broke.

session start: [A C]  -> scrape the web, no sensitive access
    latch:      disable [C]   (one-way, no going back)
config now:    [A B]  -> touch internal systems, comms dead

Now here’s where it gets uncomfortable. That latch, those sandboxes, the trusted-sender allowlist, all of it assumes the seams hold. They don’t always. Meta says so itself, in a section most people skim right past.

Behind the wall: steps you can take right now, a checklist for operators, and field-ready security prompt. Upgrade now.

The Three Seams Where the Rule Still Leaks

The Rule of Two leaks in three places, and Meta names every one of them in its own limitations section. This is the part that doesn’t make it into the LinkedIn posts.

Seam one: the [AC] pair isn’t actually safe. Meta’s original diagram labeled every two-way overlap “safe.” Willison pushed back the same weekend the paper dropped, and he’s right. An agent with untrusted input and the ability to change state, but no access to your private data, can still wreck you. It can corrupt records, fire destructive write actions, spam outbound. No secrets required. Meta quietly swapped “safe” to “lower risk” on the diagram after the pushback. That edit is the whole story. The rule reduces severity. It does not zero it.

Seam two: it’s scoped to a single session, and your agent has a memory. The rule governs what an agent holds within one session. But the nastiest agentic failures live across sessions: an agent that forgets its security constraints between runs, cross-session data bleed, residual context from a previous user surfacing in the next one. The OWASP agentic-risk crowd has been hammering this. A one-way latch inside a session does nothing about poisoned state that persists into the next session. The rule is a snapshot. The attack is a movie.

Seam three: the human-in-the-loop fallback collapses to blind clicking. When an agent genuinely needs all three, Meta’s escape hatch is human approval. Fine in theory. In practice you get alert fatigue, and the user rubber-stamps the warning interstitial without reading it, which Meta flags directly as a known failure mode. And the “or another reliable means of validation” half of that fallback? About that.

Continue reading this post for free, courtesy of ToxSec.

Or purchase a paid subscription.