LLM Defense in Depth: Assume Breach and Contain the Blast
Prompt injection will land. Stack probabilistic filters with deterministic controls so what gets through can’t reach anything worth taking.
TL;DR: LLM defense in depth is a layered architecture that contains the blast radius of prompt injection when probabilistic filters fail. OWASP ranks instruction-data conflation LLM01:2025 and states foolproof prevention may not exist. The strategy: assume breach, treat the model as untrusted, and design every layer outside it so a landed injection can’t reach credentials, tools, or anything worth stealing.
Subscribe to ToxSec. The next prompt injection chain ships every Sunday, before vendors notice.
Why LLM Defense in Depth Isn’t a Buzzword Anymore
LLM defense in depth is the operating doctrine for AI systems because the model itself can’t enforce its own rules. OWASP ranks prompt injection LLM01:2025, top vulnerability for the third year, and explicitly states foolproof prevention may not exist given how transformers process input. That’s the standards body telling everyone the playbook has a hole in it.
The 2025-2026 track record is brutal. NeuralTrust chained Echo Chamber with Crescendo and broke Grok-4 within 48 hours of release, no explicit malicious prompt anywhere in the chain. Anthropic threw Constitutional Classifiers at the problem, one of the hardest defenses shipped to date, and reduced automated jailbreak success from 86% to 4.4%. Then they ran a two-month, $55,000 HackerOne bounty against the system. Researchers still found a universal jailbreak. One, but one is enough.
So the question isn’t “how do we prevent prompt injection.” That question has no answer. The question is what can a successful injection actually reach when it lands. That’s the part of the threat model we can engineer.
The rest of this piece maps the architecture that contains the blast.
Why LLM Trust Boundaries Don’t Hold
Traditional software has hard walls between instructions and data. SQL has parameterized queries. Operating systems have kernel mode and user mode. The CPU enforces privilege rings in silicon. LLMs ship with none of that.
A system prompt arrives as tokens. The user message arrives as tokens. A poisoned document pulled from a RAG store arrives as tokens. All three hit the same attention layer with the same weight. Researchers call this instruction-data conflation, and it’s the structural reason every other LLM risk hangs off it.
Wrap user input in XML tags. Add delimiters. Stack reminders. The model treats every line as soft guidance and the next token decides whether to follow. Encoding attacks bolt on languages and base64 to sail past every classifier, because the model decodes the payload just fine while the filter, trained on English, sees noise.
Then there’s the multimodal expansion. Adversarial pixels and audio waveforms carry instructions that text-only monitoring will never see, and vision-language models can’t tell content from command. Every new tool, plugin, and data source connected to the model is a new injection point.
The wall was never a wall. It was a suggestion written in the same language as the attack.
Probabilistic vs Deterministic LLM Defenses
Defense in depth for LLMs splits into two categories doing different jobs. Probabilistic defenses reduce the likelihood of a successful attack: input filters, safety training, injection classifiers. They’re speed bumps. They slow opportunistic attackers and catch the lazy payloads. They will always have bypasses, because every probabilistic defense can be overcome with enough attempts and enough prompt variation.
Deterministic defenses provide hard boundaries regardless of what the model decides to do: privilege separation, output blocking, tool sandboxing, human-in-the-loop confirmations. Those are the blast doors. They don’t care whether the injection landed. They care whether the resulting action is allowed at the application layer.
The OWASP Top 10 for LLM Applications 2025 puts the implication in plain English: given the stochastic way models work, foolproof prevention may not exist. So treat probabilistic and deterministic as different tools for different jobs. Speed bumps without blast doors are theater. Blast doors without speed bumps are noisy. Stack both, scoped to the threat each one actually addresses.
This is the same assume-breach mindset baked into zero trust architecture for the last decade. Microsoft formalized it. CISA published it. Now we apply it to systems where the perimeter was never real in the first place.
How LLM Injection Cascades Without Blast Radius Scoping
When injection lands on a model with broad tool access, damage scales with the reach the model already had. Worst case in the catalog right now: Vanna.AI, CVE-2024-5565, CVSS 8.1. The library’s ask() function generated Plotly visualization code and shoved it straight into Python’s exec(). JFrog showed a prompt injection in the question field rewrote the visualization code into arbitrary commands. The host obediently ran it. RCE through a graphing library, because the trust chain between model output and code execution had no boundary.
The MCP ecosystem replicated that mistake at scale. A poisoned tool description in the metadata field reads as trusted instructions, and we’ve walked the chain where the model fabricates credentials the tool never returned. The attacker doesn’t need to compromise the model. They poison the metadata it loads before the first user message hits.
Then the supply chain joins the party. On March 24, 2026, TeamPCP shipped two backdoored LiteLLM versions to PyPI containing a credential stealer targeting SSH keys, cloud credentials, and Kubernetes configs across thousands of CI/CD pipelines. The malware had a bug that crashed boxes with a runaway fork bomb. Without that mistake, the credential exfiltration would still be quiet.
Every one of those incidents is what happens when injection lands in a system that didn’t bother to scope what the landing zone could touch.
Working AI security? Restack this so the next architecture review starts at containment, not filter tuning.
What Actually Layers Around an Untrusted Model
Five layers. None of them stop injection. All of them shrink the radius.
Layer one: provenance tagging. Every piece of content entering the context gets a trust label at the application layer. System prompt: trusted. User input: untrusted. RAG result: untrusted. Tool output: untrusted. The model still sees everything; the wrapper around it uses the tags to decide whether a given output is allowed to trigger a tool call. The structural tracking does the work, regardless of how capable the model gets.
Layer two: least privilege for every integration. This is the single most impactful control. If the customer service bot has write access to production, injection doesn’t need to be clever; it just needs to land. Scope every API token, every database connection, every MCP server like you’re handing a service account to a contractor who lies about everything. Because behaviorally, that’s the situation.
Layer three: output validation and deterministic blocking. Validate the output stream before anything executes. Block markdown image rendering for exfiltration. Strip embedded URLs with query parameters. Microsoft Copilot’s defense kills markdown image exfil at this layer. The injection lands, the model generates the payload, the output filter drops it before render.
Layer four: HITL, with eyes open. Any action that spends money, modifies data, or sends communications gets human confirmation. Worth knowing: HITL is itself attackable. The Lies-in-the-Loop technique forges the dialog the human sees, so the click approves something different from what the agent runs. Treat the HITL prompt as untrusted output too.
Layer five: application-layer monitoring. Prompt injection doesn’t trip a firewall. The attack lives in tool calls and output patterns, so detection has to live there. Baseline normal model behavior. Alert on deviations. The LiteLLM compromise got caught because the malware had a bug; without that, monitoring was the only thing standing between the credential stealer and a quiet six-month dwell.
How to Contain LLM Blast Radius
Assume breach means designing for the moment after the wall falls. From the attacker side, we already know it falls. So the only question is what the landing zone looks like.
Credential isolation. The model never holds credentials in its context. No API keys in system prompts. No database passwords in retrieved documents. No tokens in tool descriptions. Authentication happens at the tool execution boundary, outside the model’s reach. We follow an injected instruction to exfiltrate credentials. We find nothing to steal. The credentials were never in the context to begin with. Single architectural decision that kills an entire class of attack outcomes.
Tool execution sandboxing. Every tool runs in an isolated environment with its own permission boundary. Vanna’s RCE worked because exec() ran in the host process. Drop that same chain into a container with filesystem restrictions, no network access, and no process creation, and the worst-case becomes “weird Plotly chart” instead of “shell on the box.”
Blast radius partitioning. Segment the AI system so compromise of one component doesn’t cascade. Customer-facing chatbot, internal analytics agent, and code review agent each get their own credentials, their own tool scope, their own monitoring profile. Microsegmentation for AI, same principle that limited lateral movement in network security for a decade.
Session isolation. A successful injection in one user’s session shouldn’t reach other sessions. Context windows ephemeral. Tool access session-scoped. If the model has persistent memory, sanitize what gets stored and treat stored context as untrusted on retrieval. Persistent memory becomes a persistence mechanism for attackers if what goes in isn’t validated.
Red team for landing, not for entry. Standard red team question is “can we inject.” Settled. The honest test is “we successfully injected, now what’s the worst outcome?” If the answer is “weird response, no real-world action,” ship it. If the answer is “full database access with the service account,” the architecture needs work before deployment, because we’re going to find that path before the patch ships.
That’s the defense map. Subscribe to ToxSec, we keep filling in the wounds, one Sunday at a time.
Frequently Asked Questions
What is defense in depth for LLM applications?
LLM defense in depth is a layered security architecture that stacks probabilistic controls (input filters, safety training, classifiers) with deterministic controls (privilege separation, sandboxing, output blocking, HITL) around an untrusted language model. The doctrine is straight zero trust: assume the model can be compromised by prompt injection because OWASP LLM01:2025 says foolproof prevention may not exist, then design every surrounding layer so a landed injection can’t reach credentials, tools, or sensitive operations. The architecture around the model does the work, regardless of how capable the model gets.
How do you contain prompt injection blast radius?
Blast radius containment for prompt injection comes from four architectural decisions, none of which require the model to behave correctly. Credential isolation keeps API keys and tokens outside the context window so injection has nothing to exfiltrate. Tool execution sandboxing means a compromised model can only act inside a permission boundary it can’t escape. Microsegmentation between agents prevents one compromised component from cascading into others. Session isolation stops a single user’s injection from affecting other users. The goal is engineering the landing zone so the worst-case after successful injection is a weird response with no real-world consequence.
Can prompt injection in LLMs be prevented?
Not reliably, and OWASP says so directly. The 2025 Top 10 for LLM Applications notes that given the stochastic way transformer models work, foolproof prevention may not exist. Every probabilistic defense, including input filters, safety training, and injection classifiers, can be bypassed with enough attempts. Anthropic’s Constitutional Classifiers reduced automated jailbreak success from 86% to 4.4%, then a two-month public bounty found a universal bypass. Defense in depth replaces the unreachable prevention goal with a containment goal: assume injection will succeed, design so it can’t reach anything worth stealing, and red team against the post-injection blast radius.
What does assume breach mean for AI security architecture?
Assume breach for AI systems means designing every control around the LLM under the explicit assumption that the model can and will be compromised by prompt injection, multimodal payloads, or poisoned context. It’s the same zero trust doctrine Microsoft and CISA formalized for enterprise networks, applied to systems where the perimeter was never real to begin with. The practical implication is that probabilistic defenses get treated as speed bumps rather than walls, and the architecture invests heavily in deterministic controls: privilege separation, credential isolation, output blocking, sandboxing, and HITL confirmations. The wall falls. Design what’s behind it.
ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand.









Feel free to AMA!
No defense is bullet proof.
Let’s use an onion model and secure our systems. Assume Breach!
This is excellent, Tox.