Gemini 0.37%, Claude 0.25%, Grok 0%. Humans Destroyed Them All: ARC-AGI-3

Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

Gemini 0.37%, Claude 0.25%, Grok 0%. Humans Destroyed Them All: ARC-AGI-3

The new benchmark proved every frontier model can’t reason like a child. That same week, Anthropic gave your phone a remote shell to your computer.

ToxSec

Mar 31, 2026

TL;DR: ARC-AGI-3 landed on March 25, 2026. Gemini 3.1 Pro scored 0.37%. Claude Opus 4.6 scored 0.25%. Grok-4.20 scored 0%. Humans solved 100%. That same week Anthropic shipped Claude Dispatch, a feature that turns your phone into a live shell into your desktop agent. This is the gap: we cannot explain what these models can’t do, and we keep shipping them more reach anyway.

This is the public feed. Upgrade to see what doesn’t make it out.

What ARC-AGI-3 Is Actually Testing in AI Agents

Most benchmarks test knowledge. Ask a model to name a drug interaction, solve a merge sort, or cite the right CVSS score. It pattern-matches against its training data and answers.

ARC-AGI-3 strips all of that away. The benchmark drops an AI agent into a 64x64 color grid with zero instructions, zero goal description, zero prior training on that environment. The agent has to figure out the rules, infer what winning looks like, and execute a strategy, all from scratch. No language cues. No hints. Just a grid and a set of controls. You can try the public demo yourself at arcprize.org/arc-agi/3.

A 10-year-old solves these in minutes. The kid has never played this specific game, but they’ve spent a decade navigating cause-and-effect feedback loops in the physical world. They see a health bar and know not to brute-force. They see two matching objects and know to connect them. That inference chain is automatic. If you want a breakdown of the underlying AI concepts, the ToxSec AI Security Glossary covers fluid intelligence and abstract reasoning in the context of agent attack surfaces.

Models don’t have that background. They have token prediction trained on static text, which is exactly the wrong tool for inferring novel goals from a foreign environment.

Every Frontier Model Scored Under 1% on ARC-AGI-3

The numbers from the March 25 release are brutal. Gemini 3.1 Pro led at 0.37%. GPT-5.4 came in at 0.26%. Claude Opus 4.6 scored 0.25%. Grok-4.20 scored exactly 0%. Humans solved all 135 environments at 100%. Not a single frontier model broke a full percentage point.

The scoring metric is RHAE (Relative Human Action Efficiency). It’s not binary pass/fail. If a human completes a level in 10 moves and the agent takes 100, the agent scores 1% on that level because efficiency is squared. The models aren’t just losing. They are brute-forcing in the wrong direction, burning actions on random exploration because they cannot form a coherent model of what the environment is doing.

One result in the technical paper makes the architecture problem clear. Claude Opus 4.6 scored 97.1% on a familiar environment using a hand-built harness. On an unfamiliar environment with the same harness: 0%. The scaffolding was doing the reasoning. Strip the human-built structure and the model has nothing.

This is what we covered in the AI and Cybersecurity stream earlier this year: these models are narrowly smart. Superhuman at specific lookup tasks, near-zero at novel goal inference. ARC-AGI-3 just made that quantitative. The $2M prize pool on Kaggle runs through December 2026. When someone cracks it, that’ll be worth paying attention to. Nobody’s close yet.

Share ToxSec - AI and Cybersecurity

Claude Dispatch Security Risk and the Prompt Injection Surface

The same week ARC-AGI-3 showed every frontier model failing a 10-year-old’s puzzle, Anthropic shipped Claude Dispatch. Scan a QR code on your phone. Your phone now talks to the Claude session running on your desktop. You can send it tasks, approve commands, check in on a running job from anywhere. Useful. Also a serious rethink of the threat model.

Dispatch is architecturally different from the Cowork sandbox. Cowork scopes Claude to a specific folder. You pick what it can touch. Classic principle of least privilege. Dispatch runs outside that sandbox. It operates on your live session with full filesystem reach. Any content the agent reads, email, browser output, documents, is now a potential prompt injection delivery vehicle with direct access to everything on the machine.

We’ve broken down the MCP tool poisoning chain in detail at Watch Me Poison Your MCP. The principle is the same here: the agent cannot reliably distinguish trusted instructions from attacker-controlled content embedded in its context. ARC-AGI-3 just proved models don’t abstract-reason under novel conditions. Prompt injection is a novel condition by design. The attacker writes content the agent was never trained to treat as adversarial.

The mitigation that actually works is what we run at ToxSec: dedicated hardware, network-segregated from anything sensitive, only files you’d be comfortable showing a stranger. Assume breach from day one. For the full playbook on what prompt injection does inside an active Claude agent, that piece covers the mechanics. If you’re running Dispatch, also read how to secure your MCP server. The same defense layers apply.

ARC-AGI-3 tells us the model can’t reason like a child. Claude Dispatch ships the assumption that it can.

Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops.

Frequently Asked Questions

What is ARC-AGI-3 and why did all AI models score below 1%?

ARC-AGI-3 is an interactive reasoning benchmark where AI agents are dropped into novel game-like environments with no instructions and must infer the rules, objectives, and winning strategy from scratch. Every tested frontier model, including Claude Opus 4.6, GPT-5.4, Gemini 3.1, and Grok-4.20, scored below 1% because they lack the abstract goal-inference humans run automatically. The benchmark isolates fluid intelligence from knowledge recall, and current models fail at the former while excelling at the latter.

What makes Claude Dispatch a security risk compared to Claude Cowork?

Claude Dispatch operates outside the Cowork sandbox and shares the same session as your active Claude instance, giving it default full filesystem access. Cowork lets you scope access to specific folders, applying least-privilege. Dispatch removes that boundary. Any content the agent reads, emails, documents, web pages, can carry prompt injection payloads with direct reach to everything on the machine, significantly expanding the blast radius of a successful injection.

Does a 0% score on ARC-AGI-3 mean AI agents are useless for real work?

No. The benchmark deliberately strips away training data and instructions to isolate one specific gap: novel goal inference without scaffolding. Current AI agents are highly effective inside well-structured domains where engineers have built the harness. The danger is when deployment decisions assume the capabilities the benchmark just proved don’t exist yet. ARC-AGI-3 tells you where the guardrails are missing, not that the car doesn’t run.

ToxSec is run by a USMC veteran and Security Engineer with hands-on experience at AWS and the NSA. CISSP certified, M.S. in Cybersecurity Engineering. He covers security vulnerabilities, attack chains, and the tools defenders actually need to understand.

Gemini 0.37%, Claude 0.25%, Grok 0%. Humans Destroyed Them All: ARC-AGI-3

What ARC-AGI-3 Is Actually Testing in AI Agents

Every Frontier Model Scored Under 1% on ARC-AGI-3

Claude Dispatch Security Risk and the Prompt Injection Surface

Frequently Asked Questions

What is ARC-AGI-3 and why did all AI models score below 1%?

What makes Claude Dispatch a security risk compared to Claude Cowork?

Does a 0% score on ARC-AGI-3 mean AI agents are useless for real work?

Discussion about this video

Ready for more?