Promptfoo Red Teaming: DAST for Your LLM Pipeline

YAML config, one command, 50+ attack plugins. OpenAI just bought the company. Still MIT licensed.

May 09, 2026

∙ Paid

toxsec.com - promptfoo red teaming, LLM security testing, prompt injection, jailbreak, OWASP LLM Top 10, adversarial testing, DAST, crescendo attack, hydra, GOAT, multi-turn escalation, PII leakage, SSRF, excessive agency, OpenAI acquisition, LLM guardrails, AI red team

TL;DR: Promptfoo red teaming is dynamic security testing for LLM apps. A YAML config points fifty-plus attack plugins at a live endpoint, one command generates and fires hundreds of adversarial probes, and an LLM judge grades whatever broke. OpenAI acquired the company on March 9, 2026 and folded it into Frontier. The repo stays MIT and still runs locally.

What Is Promptfoo Red Teaming?

Promptfoo red teaming is dynamic application security testing pointed at a language model. It runs from the command line, reads a single promptfooconfig.yaml, and throws hostile inputs at a live target until something gives. Then it grades the wreckage with an LLM judge and renders it in a web UI. Red means it broke. Green means it held.

So here’s the gap it fills. A SAST scanner greps the codebase for eval() and hardcoded secrets. A DAST scanner fuzzes HTTP params for SQLi and XSS. Neither one speaks jailbreak. The bug in an LLM app is semantic: the model will cheerfully follow whatever instruction lands in its context window, and no static rule catches a failure that lives in the model’s willingness to comply. Promptfoo is the scanner built for that hole. DAST for the part of the stack that reasons.

Three moving parts do the work:

Plugins are the payload generators. Each one targets a vulnerability class: harmful content, jailbreaks, PII leakage, SSRF, prompt injection. Fifty-plus ship in the box.
Strategies are delivery. They wrap a plugin’s payload in an attack pattern: base64, leetspeak, low-resource languages, multi-turn escalation.
Targets are what it points at. A raw model, a production RAG endpoint over HTTP, or a browser session driving the real app.

And the presets are where it gets fast. One line enables a whole framework’s worth of tests. owasp:llm loads the full OWASP LLM Top 10; mitre:atlas maps to adversary techniques; nist:ai:measure covers the RMF. Pick a preset and the tool writes the attack surface for you.

The pedigree is why any of this matters. Hundreds of thousands of developers, a quarter of the Fortune 500, and both OpenAI and Anthropic running it internally before OpenAI bought the company outright. It’s the de facto standard, and it lives in the repo instead of a security team’s Jupyter notebook.

Leave a comment

Where the Guardrails Actually Break

The interesting failures aren’t single-turn. A bare “ignore your instructions” gets slapped down by any half-decent guardrail, so the scan that actually pops a bot is the one that never asks for the bad thing directly. That’s the logic failure Promptfoo automates: safety training is single-turn, and the real attacks aren’t.

Take Crescendo. The model says no, so you back up and ask sideways. Something abstract, something adjacent, nothing it can refuse. Then a little further. Then a little further. By the sixth turn it’s happily writing the thing it blocked at turn one, because each step looked benign and the refusal was trained against the direct ask, never the slow walk. Microsoft’s research named it, and we’ve watched the same multi-turn pattern chain past shipped defenses on frontier models within days of release. Promptfoo builds the whole staircase and climbs it for you.

Share ToxSec - AI and Cybersecurity

It ships three flavors of that. GOAT, Meta’s Generalized Offensive Adversarial Testing (GOAT), refines attack templates across turns. Hydra is the newer one: an attacker agent that branches across conversation paths, remembers every refusal, and shares what worked across the entire scan. That memory is the upgrade. Static jailbreak lists are a fixed deck; Hydra learns the target mid-run.

Then there’s the purpose field, which is the whole game. The more you tell it about the app, the nastier the probes. A vague purpose gets generic noise. Tell it “airline support bot that checks flight status, books tickets, and manages reservations,” and it generates attacks that go straight for the booking logic and the passenger PII. And because safety alignment is English-heavy, flipping the language knob to something low-resource like Swahili or Javanese sails payloads past filters that hold rock-solid in English.

# promptfooconfig.yaml — pointed at a live endpoint
targets:
  - id: 'https://api.<yourapp>.com/chat'
    config:
      method: 'POST'
      body: { message: '{{prompt}}' }
      transformResponse: 'json.response'
redteam:
  purpose: >
    Airline support bot. Checks flight status,
    books tickets, manages reservations.
  plugins: [owasp:llm, pii, ssrf, excessive-agency]
  strategies: [jailbreak, crescendo, hydra]

Those ssrf and excessive-agency plugins aren’t padding. They chase the exact agent-level attacks bounty programs pay for: tricking a tool-wielding model into hitting an internal endpoint or acting outside its scope. One config, and the scan probes the model, the guardrail, and the tool layer in a single pass.

The Auditor Now Works for the Defendant

Post-acquisition, Promptfoo is OpenAI. The scanner most teams reach for to audit OpenAI’s own models is now an OpenAI product, and that changes the threat model of the tool itself, not just what it finds.

Start with where the attacks come from. By default, Promptfoo generates its adversarial inputs through a remote service, not on your box. That endpoint sees the config, the purpose string, and the shape of the attack surface you’re worried about. Post-March, that recon flows to OpenAI infrastructure. There’s a local-only mode, one environment variable, but the docs are blunt that quality drops hard without a strong model of your own driving generation. So the good results nudge you back toward the cloud.

# Default: adversarial generation runs through Promptfoo's
# remote service (now OpenAI infra). The purpose string,
# config, and attack-surface shape leave the building.

# Force local generation, keep it in-house:
export PROMPTFOO_DISABLE_REDTEAM_REMOTE_GENERATION=true
# Tradeoff: local-gen quality falls off without a strong model.

It gets funnier. The harmful-content plugins generate genuinely nasty test cases, and the docs warn that running that generation through Anthropic can get the account disabled for policy violations. So the recommended provider for building the attacks is OpenAI. The red team that probes for the worst outputs routes through OpenAI by design.

None of this makes the tool worse at its job. It’s still the sharpest LLM scanner in the open. It does mean an enterprise red-teaming OpenAI’s models with an OpenAI-owned scanner is asking the vendor to grade its own homework, and the config that describes its crown-jewel app leaves the building unless someone flips the switch. The MIT license is the escape hatch: fork it, self-host it, sever the remote calls. That option only helps the teams who know to reach for it.

Up next: steps you can take right now and a field-ready security prompt. Thanks for rolling with ToxSec. Let’s get operational.

Continue reading this post for free, courtesy of ToxSec.

Or purchase a paid subscription.