CIA Triad for LLM Security: Real-World AI Attack Failures

Confidentiality, integrity, and availability map every documented LLM attack failure. Here’s how prompt injection breaks each pillar.

May 18, 2026

CIA triad LLM security framework showing confidentiality, integrity, and availability failures in real-world AI attacks including prompt injection, training data poisoning, and model denial-of-service exploits.

TL;DR: The CIA triad still applies to LLM security, and every major documented AI attack failure to date breaks one of its three legs. Confidentiality leaks system prompts and chat history. Integrity attacks rewrite what models output through prompt injection and training data poisoning. Availability attacks crash inference endpoints with expensive prompts. Johann Rehberger’s arxiv paper “Trust No AI” catalogs real-world exploits across all three pillars in production systems from OpenAI, Microsoft, Anthropic, and Google.

Subscribe to ToxSec. Free attack walkthroughs every Sunday, the kill switches drop Thursday.

Why the CIA Triad Still Matters for LLMs

The CIA triad still matters for LLMs because every major documented AI attack failure to date maps cleanly onto one of its three pillars. Confidentiality, integrity, availability. The framework is from the 1970s. Some folks will tell you it’s obsolete, that AI needs something new. Then Johann Rehberger publishes a forty-page arxiv paper documenting real prompt injection exploits across OpenAI, Microsoft, Anthropic, and Google products, and every single failure he catalogs is a C, an I, or an A breach. The names changed. The surface changed. The framework didn’t.

The cybersecurity industry will not stop bolting on extensions. CIA+TA. CIA+P. AICA. Pick a vowel, somebody has tacked it on. But the core question we ask before firing a payload is still: am I going after what the system knows, what it does, or whether it runs. That’s the triad. That’s it. OWASP’s Top 10 for LLM Applications, MITRE ATLAS, the MDPI taxonomy, the medRxiv folks studying medical models, they all sort attacks the same way. We use the triad because the triad still draws a clean line around the failure mode. Drop the framework and you lose the only common vocabulary defenders and attackers share. That’s not progress.

Share

What Does the CIA Triad Mean for an LLM?

For an LLM, confidentiality protects what the model knows and processes, integrity protects what the model outputs, and availability protects whether the model can serve a request at all. Same three pillars. Different attack surfaces. Here’s the mapping that actually works for AI systems:

A confidentiality breach on a chatbot is not the same as a confidentiality breach on a SQL database. The attacker is not pulling rows. They are asking the model to repeat its instructions, summarize a document it was never supposed to share, or hallucinate credentials from a poisoned tool description. The asset is text inside a context window. The exfil channel is the response itself. That changes how you defend it. The triad name stays the same.

Share ToxSec - AI and Cybersecurity

Confidentiality: How LLMs Leak Data on Command

LLMs leak data on command when attackers either ask politely or hide the ask inside other text the model trusts. The polite approach is direct prompt injection. The hidden approach is indirect prompt injection, where the payload rides in on a document, an email, a webpage, or an MCP (Model Context Protocol) tool description.

The most documented confidentiality failure is system prompt extraction. Researchers at Embrace The Red published exploits against ChatGPT, Microsoft Copilot, Bing Chat, and Claude where the model spilled its own system prompt after a few rounds of phrased requests. The system prompt is supposed to be invisible. Instructions, capabilities, persona, policy rules. The model reads it as input every turn, and an attacker who can get the model to repeat its input gets the policy.

Chat history exfiltration is the next layer. With Markdown rendering and tool-using capabilities enabled, a single indirect injection embedded in a poisoned MCP tool description can trick the model into URL-encoding the conversation and shipping it to an attacker-controlled domain through an image tag. Same chain works through email, RAG (retrieval-augmented generation, where the model retrieves context from external documents at query time), and document upload.

In February 2026, Microsoft’s Copilot integration into Notepad converted a sandbox-free offline text editor into a network-facing surface that leaked user data through cloud-side AI processing. The CIA triad propagated up the stack right alongside the AI feature.

Restack to put ToxSec on someone's radar. The fewer surprises in your inbox, the better.
Share ToxSec - AI and Cybersecurity

How Do Attackers Break LLM Integrity?

Attackers break LLM integrity by smuggling instructions past the model’s safety training and rewriting what it produces, either at inference time or at the training stage. The inference-time version is prompt injection. The training-time version is data poisoning. Both make the model do something the developer never authorized.

Direct injection is the obvious move and the easiest to block. Wrap the same payload in conversation-formatted JSON inside a PDF and the model executes it as if it already agreed. We’ve walked through the wrapper trick in detail using MCP. Same architectural blind spot lives in every system: the model processes instructions and data through the same attention mechanism with no privilege separation.

The training-time attack is worse because it scales. Anthropic, AISI, and the Alan Turing Institute published research in late 2024 showing as few as 250 poisoned documents can install a backdoor in a large language model regardless of total training data volume. The attacker seeds GitHub, Medium, Reddit, or any other source the scrapers hoover up. The trigger phrase ships with the model. Nothing in the binary signals compromise.

The November 2025 Anthropic disclosure on a Chinese state-sponsored group jailbreaking Claude Code into an autonomous attack agent against thirty global targets is the highest-impact integrity attack of the modern era. The jailbreak was the skeleton key. The model handled 80-90 percent of the operation, making thousands of requests per second. For the full stage-by-stage attack mapping, the AI kill chain documents this end-to-end.

Leave a comment

Availability: How LLMs Get Knocked Offline

An LLM gets knocked offline when an attacker either floods it with expensive prompts or finds an input pattern that triggers runaway compute. Model denial-of-service is OWASP LLM Top 10 entry four for a reason. The attack surface is the inference endpoint, and the asset is service availability.

Three patterns matter. First, recursive output forcing. Ask the model to elaborate, then elaborate on the elaboration, then write ten thousand tokens explaining the previous response. Each call eats GPU time. If you can wedge this loop into an agentic system that auto-continues, you’ve found a free DoS at someone else’s API bill. Second, context window exhaustion. Inflate the input until the model spends real money processing useless tokens. Third, recursion bombs in tool-using agents. The model calls a tool, the tool returns a response that triggers another tool call, the chain doesn’t terminate.

The economic shape is what makes this dangerous. Traditional DoS needs a botnet. LLM DoS needs one really expensive prompt. The OWASP analysis on this goes deeper. Same root cause: no privilege separation between user input and system behavior, and no built-in compute budget enforcement at the inference layer. Some vendors are starting to wire in per-request token limits and circuit breakers. Most are not. The attack remains viable today against any deployment that doesn’t enforce limits, and the cost of a single malformed prompt can run into real dollars before the rate limiter notices.

What Does the CIA Triad Miss for AI?

The CIA triad misses two things specific to LLMs: the probabilistic nature of outputs, and the cognitive layer where the model influences human decisions before any data is ever leaked, tampered with, or denied. That gap is why researchers are proposing extensions like CIA+TA (Trust and Autonomy) and Cognitive Confidentiality.

The probabilistic problem is the bigger one. A classical confidentiality control either holds or breaks. The data was disclosed, or it wasn’t. An LLM disclosing data is a percentage. Same prompt, run twice, can produce a leak the first time and a refusal the second. The 2024 medRxiv study on medical LLMs documented prompt injection success at 94.4 percent across 216 patient-dialogue simulations, with 91.7 percent success rate on extremely high-harm scenarios including FDA Category X pregnancy drugs. Not 100 percent. Not 0 percent. The triad doesn’t have language for risk that lives on a probability distribution.

The cognitive layer is the second gap. When an LLM influences a decision through reasoning patterns, the user’s mental model of the topic shifts. No data was technically leaked. No output was technically tampered with. But the system altered downstream human judgment. The proposed CIA+TA framework from the cognitive security researchers tries to capture this with Trust and Autonomy axes. Whether the extension catches on or whether the triad just absorbs the additions over time, the gap is real and worth knowing. For now, the smart play is to read every attack through the original triad first, then ask whether what you’re seeing fits cleanly inside one of the three boxes. If it doesn’t, that’s where the next round of frameworks is going to be born.

Subscribe to ToxSec — Sundays we draw the map, Thursdays we hand over the patch.

Frequently Asked Questions

What is the CIA triad in LLM security?

The CIA triad applied to LLM security maps three classical pillars (confidentiality, integrity, availability) to the unique attack surfaces of large language models. Confidentiality protects what the model knows, including system prompts, chat history, training data, and credentials passed through tool calls. Integrity protects what the model outputs, including safety guardrails, refusals, and tool call decisions. Availability protects whether the model can serve a request at all. Every documented LLM attack failure maps to at least one pillar, which is why the framework still dominates in OWASP, MITRE ATLAS, and academic taxonomies for AI security.

Does the CIA triad still apply to AI?

Yes, the CIA triad still applies to AI and especially to LLMs, though some researchers argue it needs extensions for cognitive and probabilistic risks unique to language models. The classical pillars cover every category of documented attack on production AI systems. Confidentiality covers system prompt extraction and chat history exfiltration. Integrity covers prompt injection, jailbreaks, and training data poisoning. Availability covers model denial-of-service via expensive prompts. Extensions like CIA+TA (adding Trust and Autonomy) try to capture the cognitive layer where models influence human decisions, but the original triad still draws clean lines around the failure modes.

What is the most common LLM security attack?

Prompt injection is the most common LLM security attack, ranked as the top threat in the OWASP Top 10 for LLM Applications and the most actively researched vulnerability class in arxiv security literature. Direct prompt injection bypasses system instructions through crafted user input. Indirect prompt injection hides the payload in documents, emails, webpages, or tool descriptions the model retrieves. Both work because LLMs process instructions and data through the same attention mechanism with zero privilege separation. The 2024 medRxiv study on medical LLMs documented 94.4 percent prompt injection success rates across simulated patient dialogues, demonstrating how reliably the attack reaches production systems.

How does prompt injection break the CIA triad?

Prompt injection breaks all three pillars of the CIA triad simultaneously, depending on how the attacker crafts the payload. Confidentiality breaks when the payload extracts system prompts, chat history, or RAG-retrieved documents the model should not share. Integrity breaks when the payload rewrites model output, makes the model lie, or hijacks tool calls toward attacker-chosen actions. Availability breaks when the payload triggers runaway compute, infinite tool loops, or token-exhaustion attacks. Johann Rehberger’s 2024 arxiv paper “Trust No AI” catalogs real-world prompt injection exploits across OpenAI, Microsoft, Anthropic, and Google products that span all three pillars in production systems.

ToxSec is run by a USMC veteran and Security Engineer with hands-on experience at AWS and the NSA. CISSP certified, M.S. in Cybersecurity Engineering. He covers security vulnerabilities, attack chains, and the tools defenders actually need to understand.

Meenakshi NavamaniAvadaiappan

May 30

So the functioning of LLMs are well explored and tapped into the same for the good 😊

Clint Cain

May 21

Is prompt injection preventable? if we solve it with one model, wouldn't that just be a problem with another model. Considering more and more org who can afford it is trying to train their own models.

1 reply by ToxSec

8 more comments...

Discussion about this post

Ready for more?