AI Tar Pits Are Drowning LLM Scrapers in Infinite Garbage

How tools like Nepenthes, Iocaine, and Cloudflare’s AI Labyrinth trap unauthorized crawlers in endless mazes of generated nonsense and poison the training set on the way out.

Jun 21, 2026

TL;DR: AI tar pits trap LLM scrapers in an infinite loop of machine-generated junk, burning their compute and feeding poison into the training set. Nepenthes started it, Iocaine sharpened it, and Cloudflare shipped AI Labyrinth to 50 billion daily crawler requests. The crawler can’t tell the maze from the real site, so it walks in and never comes back.

This is the public feed. Upgrade to see what doesn’t make it out.

What Is an AI Tar Pit?

An AI tar pit is a trap that drowns a web crawler in infinite generated garbage instead of blocking it. Block a scraper outright and you tip your hand. The operator sees the 403, shrugs, rotates the IP, switches the user-agent, and comes back through a residential proxy an hour later. So the tar pit does the opposite. It says yes to everything. It serves the bot an endless tree of pages, each one stuffed with links that loop back into the maze, each page slow enough to waste real wall-clock time but cheap enough to not torch your own server.

The name comes from Nepenthes, a carnivorous pitcher plant. You slip in, you slide down, you don’t climb back out. Configured as a trap behind a web server, any web crawler that hits it gets an endless stream of randomly generated pages with many URLs to follow. The crawler treats every fake link as a fresh discovery. It chases them. They lead deeper. There’s no bottom.

Here’s the thing that makes it nasty: the bot has no exit condition. A human gets four pages into a maze of word salad and closes the tab. A scraper doesn’t have taste. It just queues the next URL.

Leave a comment

How the Crawler Falls In

The trap works because the scraper can’t tell a real link from bait. Modern LLM crawlers operate on one dumb assumption: a link is a link, and content is content worth grabbing. They don’t evaluate whether a page is meaningful before fetching it. They just follow the graph and tokenize whatever comes back.

Nepenthes weaponizes exactly that. It generates an endless sequence of pages, each with dozens of links that simply go back into the tar pit. Pages are randomly generated, but in a deterministic way, so they appear to be flat static files that never change. Determinism matters here. If the same URL returned different garbage each visit, a smart crawler might flag it as dynamic and bail. Instead the tar pit fakes the one signal scrapers trust most: stability. Same URL, same nonsense, every time. Looks like a real archive.

And there’s a deliberate stall baked in. An intentional delay gets added to keep the crawler from bogging down your own server, while still wasting its time. The bot sits there waiting on a slow response that was never going anywhere.

GET /maze/a8f3/index.html      200   1.4s   38 links
GET /maze/a8f3/c19b.html       200   1.5s   41 links
GET /maze/a8f3/c19b/77de.html  200   1.4s   39 links
GET /maze/a8f3/c19b/77de/...   200   1.6s   40 links
  [depth: 4]  [unique pages so far: 6,212]  [exit: none]

Six thousand pages deep and the crawler still thinks it’s making progress. The link count never drops to zero, so the work queue never empties.

Falling In: Terminal: AI tar pit terminal showing an LLM crawler trapped in a Nepenthes maze, queue depth climbing past 9,000 pages with zero real data scraped and no exit condition.

Poisoning the Model on the Way Out

The second payload is the data poison, and it’s the part the AI companies actually fear. Burning a crawler’s compute is annoying. Corrupting the training corpus is structural.

Most tar pits ship an optional Markov-chain text generator. The Markov babble feature gives the crawlers grammatically plausible text to scrape and train on, with the explicit goal of accelerating model collapse. Markov output reads almost right. Real words, real sentence shapes, zero meaning. It’s the perfect poison because a naive quality filter waves it through. It passes the “is this English” check and fails every “is this true” check that nobody’s running at scale.

Iocaine, the follow-on tool named after the poison from The Princess Bride, leans all the way into this. Gergely Nagy built it after watching crawlers chew through his bandwidth, and his fix was to serve them a heaping plate of garbage designed to slowly corrupt the datasets they feed. Why does this land? Because model collapse is a real, documented failure mode. Train a model on enough of its own slop, or enough synthetic noise dressed up as human text, and the tails of the distribution rot out. We broke down the math on that in AI model collapse makes hallucination inevitable, and the same recursive-degradation problem is what tar pits are trying to force on purpose.

One catch the operators are honest about. No corpus ships with the tool, on purpose, so every install looks different and harder to fingerprint. You bring your own text. Everybody’s poison tastes a little different, which is exactly the point.

Share ToxSec - AI and Cybersecurity

Cloudflare Turned It Into a Product

Cloudflare took the rebel tooling and shipped it to the whole internet as AI Labyrinth. Same core idea, corporate paint job, opt-in toggle in the dashboard. When it detects improper bot activity, it automatically deploys a network of linked AI-generated pages, no custom rules needed, and it’s available even on the free plan.

The scale tells you why they bothered. Cloudflare says AI crawlers generate more than 50 billion requests to its network every single day, and the existing block-and-deny tools tip attackers off so they just shift approach. So instead of slamming the door, they built the maze and made it quiet.

Then they bolted on a detection layer the indie tools didn’t have. No real human goes four links deep into a labyrinth of AI nonsense, so anything that does is almost certainly a bot, which hands Cloudflare a brand-new fingerprinting signal. The trap doubles as a sensor. The pages are hidden behind nofollow links a human browser never renders, so the only thing that walks in is something crawling the raw graph. Walk the maze, get tagged, get added to the shared bad-actor list every other Cloudflare customer pulls from.

This is the same dynamic we keep flagging in the free tooling that catches AI-generated junk: the cheap detection signal is “did the machine do something no human would bother to do.”

# the shape of the trap, not the trap
labyrinth:
  trigger: suspected_ai_crawler
  inject: nofollow_decoy_links     # human browsers never render these
  serve: generated_pages
  on_traversal:
    confidence: high_bot
    action: fingerprint_and_share   # feeds the global block list

Where This Goes Next

Right now the tar pits win on one assumption: crawlers are greedy and dumb. That edge has a shelf life. The generated mazes still don’t perfectly match a real site’s structure or branding, so a crawler trained to spot the seam could learn to route around them. Cloudflare already knows this and has said it wants future labyrinth pages to mirror the host site’s real layout and content so the seam disappears.

That’s the arms race in one sentence. The defender makes the fake indistinguishable from the real. The scraper learns the tell. The defender patches the tell. Round and round, same as every cat-and-mouse game in this space. The tar pit doesn’t have to win forever. It just has to make scraping expensive enough, today, that somebody else’s site is the cheaper meal.

Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops. Upgrade now.

Frequently Asked Questions

What is an AI tar pit and how does it stop scrapers?

An AI tar pit is a defensive trap that catches an unauthorized LLM scraper and feeds it infinite machine-generated garbage instead of blocking it. The crawler follows an endless tree of fake links that loop back on themselves, burning its compute and wall-clock time while it thinks it’s collecting real data. Tools like Nepenthes and Cloudflare’s AI Labyrinth pull this off by serving deterministic generated pages that look like stable static files, which is the one signal crawlers trust. The bot has no exit condition, so it keeps queueing URLs that go nowhere.

Can a tar pit actually poison an AI model?

Yes, that’s the second payload, and it’s the part AI companies fear more than the wasted compute. Most tar pits include an optional Markov-chain generator that produces grammatically correct text with no real meaning. That text passes naive quality filters because it reads like English, then corrupts the training corpus that ingests it. Fed at scale, this accelerates model collapse, the documented failure mode where models trained on recursive synthetic slop lose the tails of their data distribution and degrade. Operators supply their own text corpus so each poison is unique and harder to fingerprint.

Is deploying an AI tar pit safe for my own site?

Not for free. A tar pit makes no distinction between an LLM scraper and a legitimate search engine crawler, so deploying one carelessly can get a site dropped from search results. Because the trap is built to feed crawlers exactly what they hunt for, it also draws constant bot traffic that spikes server CPU. Nepenthes’ own author labels it deliberately malicious software and warns operators not to run it unless they fully understand the fallout. Cloudflare’s AI Labyrinth is the safer route since it scopes the maze to suspected bots only and keeps it off pages real users see.

ToxSec is run by a USMC veteran and Security Engineer with hands-on experience at AWS and the NSA. CISSP certified, M.S. in Cybersecurity Engineering. He covers security vulnerabilities, attack chains, and the tools defenders actually need to understand.

Discussion about this post

Ready for more?