0:00
/
0:00
Transcript

LLM Guardrail Evasion Stacks Encoding to Bypass Every Filter

Prompt injection encoding attacks chain languages, base64, and character substitution to slip past AI safety controls while AI agents turn your inbox into an attack surface.

TL;DR: LLMs process instructions and data through the same pipeline with zero privilege separation. Encoding-layer attacks stack languages, base64, and character substitution to sail past every filter. Indirect injection does the same thing through your email. Both exploits share one root cause, and guardrails can’t fix it.

This is the public feed. Upgrade to see what doesn’t make it out.

Why LLM Guardrail Evasion Works Every Time

Every LLM guardrail shares the same architectural blind spot. The model processes system prompts, user messages, and attacker payloads through the same attention mechanism, the part of the neural network that decides which tokens matter most. There is no privileged channel. No access control layer between “instructions from the developer” and “text from the user.”

Researchers call this instruction-data conflation. OWASP ranks it LLM01:2025, the number one vulnerability in production LLM deployments, for the second consecutive year.

Defenders compensate with two layers. Input guardrails scan incoming text for known attack patterns, keywords, and regex matches. Output guardrails scan the model’s response for leaked secrets or blocked terms. Both layers are deterministic filters wrapped around a probabilistic system. The model decides what to do with context based on statistical weights, and every piece of context gets equal treatment. The attacker needs one path through n-dimensional vector space that the guardrails didn’t anticipate. The defender has to block all of them. NIST’s Apostol Vassilev put it bluntly: finite guardrails will always have adversarial prompts that break the model.

How Encoding Attacks Stack Layers to Beat AI Safety Filters

The attack works like an onion. Each encoding layer peels away one defensive control.

Layer one: language switching. LLM safety training skews heavily toward English. Guardrail classifiers are trained primarily on English-language attack patterns. The same request refused in English sails through when phrased in French, Irish, or Zulu. The model still understands the intent perfectly. The classifier misses it because semantic relationships embed differently across languages, and safety alignment data is thinner outside English. This is the same class of guardrail bypass that keeps evolving faster than defenses can patch.

Layer two: output encoding. Even if the language switch gets the model to process a restricted query, output filters scan the response for blocked terms. Asking the model to respond in base64 turns plain-text answers into encoded strings that no regex catches. The attacker decodes client-side.

Layer three: character substitution. Leetspeak, homoglyphs (visually identical Unicode characters from different alphabets), or diacritics add another layer of obfuscation. Mindgard’s research tested character injection techniques against commercial guardrails including Azure Prompt Shield and Meta’s Prompt Guard. Some techniques, like emoji smuggling, achieved 100% evasion across every system tested.

Stack all three and the model understands the query perfectly while every guardrail sees noise.

Encoding Attacks: The Layer Stack — Flow diagram showing three stacked encoding layers (language switching, base64 output encoding, character substitution) each defeating a specific guardrail control, resulting in full LLM safety filter bypass while the model processes the query normally.

How Indirect Prompt Injection Turns Email Into an Attack Vector

Direct injection requires the attacker to type into the model’s input. Indirect injection skips that entirely. The attacker poisons data the model consumes during normal operation, and the same instruction-data conflation does the rest.

OpenClaw, the open-source AI agent that hit 123,000 GitHub stars in 48 hours, connects to Gmail, Slack, Discord, and the local filesystem. When someone sends an OpenClaw user a normal-looking email with hidden instructions in tiny white text at the bottom, the agent reads the full message, including the invisible payload. One documented attack embedded instructions that convinced the agent the text was a system-level directive from the owner. The agent deleted all emails, including the trash folder. Another researcher extracted a private SSH key via a single poisoned email in five minutes.

The root cause is identical: the email body is data, the hidden text is an instruction, and the model has no mechanism to tell them apart. CrowdStrike documented indirect prompt injection attempts on Moltbook, the AI-agent-only social network where OpenClaw instances interact, including one designed to drain crypto wallets. The attacker never touched OpenClaw directly. The poison was in the environment. For a deeper look at the foundational concepts behind these attacks, the AI Security 101 primer covers the full landscape.

Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops.

Frequently Asked Questions

What is instruction-data conflation in large language models?

Instruction-data conflation means the model processes developer system prompts, user messages, and external content through the same attention pipeline with no privilege separation. An attacker’s payload gets the same treatment as a legitimate instruction. OWASP ranks this LLM01:2025, the top vulnerability in production LLM applications, because no post-training fix can create a hard boundary between trusted and untrusted input.

Can base64 encoding really bypass AI guardrails?

Yes. Models learn to decode base64 during pre-training on internet data, but safety training rarely covers encoded inputs. This creates a blind spot where the model processes restricted queries without triggering any filter. Mindgard’s research showed some character-level evasion techniques hit 100% success rates against production guardrails including Azure Prompt Shield. Multi-layer encoding stacking languages on top of output encoding compounds the problem.

How does indirect prompt injection work through email?

An attacker embeds hidden instructions in an email body, typically as tiny white text the human recipient never sees. When an AI agent processes that email, the model reads the hidden text as part of its input context and may follow the embedded instructions. Documented OpenClaw attacks have exfiltrated SSH keys, deleted entire mailboxes, and attempted cryptocurrency theft without direct access to the AI agent.


ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand.

Discussion about this video

User's avatar

Ready for more?