One Magic String from Anthropic Silences Claude (RAG DoS Exposed)
A documented QA test string becomes a sticky DoS primitive through prompt injection, RAG poisoning, and context persistence
TL;DR: Anthropic ships a test string that makes Claude stop responding on command. It’s a QA feature. It’s also a one-line DoS payload. Inject it through RAG, tool output, or shared chat history and you get a sticky refusal loop that persists until a human manually purges the context. The fix is input sanitization.
This is the public feed. Upgrade to see what doesn’t make it out.
One Documented String Bricks the Whole Model
Drop Anthropic’s test string into Claude’s prompt context. System prompt, user message, retrieved document, tool output. Claude stops talking. stop_reason: "refusal". No error. No explanation. Just silence.
Anthropic built it so developers could test refusal handling in streaming responses. Starting with Claude 4, classifier intervention mid-stream returns a refusal stop reason instead of completing the message. The magic string triggers that behavior deterministically without crafting an actual policy violation.
Works as designed. That’s the whole problem.
A predictable failure mode is a feature when you control the inputs. When your input surface includes untrusted data, it’s a primitive. LLM input surfaces are almost entirely untrusted data.
One String Kills the Pipeline
Prompt injection 101. Attacker gets payload into context, attacker wins. The magic string just changes the win condition from “hijack behavior” to “kill output.”
User input fields are the gimme. Any form that concatenates into a system prompt: support tickets, profile bios, usernames.
But the real damage scales through RAG. Poison a document in the knowledge base and queries that retrieve it flatline. The embedding model doesn’t know it’s a payload. The retriever doesn’t care.
Tool outputs hit harder. Claude Code reads files, parses logs, processes API responses. Plant the string in a config file, a stack trace, a PR description. The code review bot eats it and dies. Anything downstream in the chain goes with it.
Then there’s multi-user chat. Shared conversation history where one participant poisons what everyone else sees. One message in the history, every future turn is bricked. The attacker doesn’t even need to be present anymore.
The Refusal Loop That Won’t Die
Anthropic’s own docs say it: when you hit a refusal, reset the context. Don’t retry with the same conversation history. Good advice. Also the exact reason this DoS sticks.
Someone drops the string into a RAG document. Query retrieves it. Refusal. App resets context and retries. Retriever pulls the same document. Refusal. Loop runs until a human finds the poisoned doc and manually yanks it.
Same pattern with conversation history. Support agent asks Claude to summarize a ticket thread containing the string. Refusal. Retry. Same history, same string, same wall. Every future turn in that conversation is dead and the agent has no idea why.
# Retry logic feeds the loop
for attempt in range(MAX_RETRIES):
response = client.messages.create(
messages=conversation_history + [new_message]
)
# stop_reason: "refusal", every single time
The poisoned context lives in storage and replays on every request. A single injection outlasts the attacker’s session. Sticky DoS from a documented feature. Zero exploit development, zero payload tuning.
The mitigation is straightforward: strip the string from all untrusted content before it hits the prompt. But that requires knowing the string exists, knowing it’s a threat, and having sanitization on every input surface.
When QA Tools Become Weapons
Automated pipelines halt the moment the model won’t respond. Triage bots, code review, ticket routing, compliance checks. All dead.
Inject the string into a PR description and the review bot chokes. Inject it into a support ticket and the routing system fails silently. Most monitoring won’t distinguish “model refused” from “model is slow” because the pipeline doesn’t crash. It just produces nothing.
Multi-tenant systems become ripe for selective disruption. Compliance bot about to flag your transaction? Force a refusal. Bot fails, you proceed. System logged a refusal, not an error. Nobody gets an alert.
The string doubles as a vendor fingerprint too. Inject it into an unknown endpoint, check for the refusal signature, and you’ve confirmed Claude on the backend. Useful reconnaissance for targeted work against infrastructure you can’t directly inspect.
Your QA Docs Are Someone Else’s Exploit Notes
Every test hook, every debug flag, every deterministic behavior is a weapon if an attacker can reach it through content you didn’t sanitize. The fix is input filtering on every surface that touches prompt context. The gap is that nobody reads QA docs looking for attack surface.
Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops.
Frequently Asked Questions
What is the Anthropic test string that bricks Claude?
Anthropic built a specific test string as a QA tool to simulate refusal behavior in Claude 4 and later models. When this string enters the prompt context—whether through system prompt, user message, RAG document, tool output, or shared chat history—Claude instantly stops responding with stop_reason: “refusal”. What was designed for developers to test streaming refusals has become a deterministic one-line denial-of-service payload.
How can attackers use the Anthropic refusal string for DoS attacks via RAG and tools?
Attackers poison RAG knowledge bases, tool outputs (logs, configs, PR descriptions, stack traces), support tickets, or shared conversation histories with the string. The embedding model and retriever treat it like normal content, so every query that pulls it—or every retry that replays the history—triggers a refusal. This creates a sticky DoS loop that survives until a human manually removes the poisoned data from storage.
How do you fix or prevent the Claude test string vulnerability?
The only reliable defense is aggressive input sanitization on every surface that touches the prompt context: strip the Anthropic test string from user inputs, retrieved documents, tool responses, file uploads, and conversation history before it ever reaches Claude. Without filtering at every layer, automated pipelines (code review bots, triage agents, compliance checks) remain wide open to this trivial, zero-effort disruption.
ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand.





As always, feel free to AMA!
As usual a brilliant post, bringing to light something I had never considered but which now seems completely obvious. Will update my protocols accordingly! Do you think the way forward is for Anthropic (or any other AI company) to ship products with unique kill switches? Maybe ones that update based on the unique system state or local files of the user? Would be relatively easy to implement I imagine.