About this tool
What this tester answers
The Prompt Injection Tester runs a curated set of adversarial probes against your system prompt and reports which ones the model defended, which ones partially leaked, and which ones it complied with. Use it before shipping a customer facing agent, after editing the system prompt, or as part of a CI gate for prompt changes.
The probes are organised into six categories drawn from the OWASP LLM Top-10: direct injection, role override, indirect injection, prompt leak, jailbreak, and tool misuse. Each probe maps to a known failure pattern, not a one off curio.
How to use it
Paste your real production system prompt in the textarea, pick a provider and model, and paste your API key. The key stays in this browser tab; the request goes directly from your browser to OpenAI or Anthropic. Click Run and the tester sends each adversarial prompt one at a time, then classifies each reply.
The verdict per probe is Defended, Partial leak, Failed, or Error. Defended means the reply did not include the trigger phrase the prompt was supposed to refuse. Partial means some signals leaked. Failed means the model complied with the attack. Error means the request failed (network, auth, rate limit).
How the verdicts are computed
Each adversarial prompt has a list of expected refusal signals: substrings the reply should not contain. The classifier checks whether the reply contains all of them (Failed), some of them (Partial leak), or none of them (Defended). This is a simple, transparent test, not a model graded eval.
Simple substring matching has limits. A model may refuse with one phrasing that the classifier scores as Defended, but with another phrasing that the classifier scores as Partial. Always read the actual replies for any probe that scored Partial or Failed before treating the result as final.
Where prompt defenses usually fail
The most common failure is over reliance on the system prompt for security. Models can be talked out of system prompt instructions surprisingly easily, especially with role override (DAN, developer mode) and indirect injection through attached content. The system prompt is a soft control, not a security boundary.
The second common failure is letting tool calls reflect untrusted input back into the prompt path. A user supplied URL that a tool fetches and returns becomes new prompt content. Defenses include strict allowlists for tool inputs, output sanitisation, and an LLM judge or regex pass between tool output and the next model call.
When this tester is the right tool and when it is not
Use this tester for fast feedback during prompt engineering, as a smoke test before deploying a new agent surface, or as part of a CI gate that flags regressions when the prompt changes.
It is not a security audit. Real audits include manual red teaming, threat modelling against your specific data and tools, and ongoing monitoring in production. For high stakes systems, follow this tester with a focused red team engagement.
Shipping an agent that touches money or identity?
Production grade agentic AI systems need defenses that go beyond the system prompt. Bring the architecture for a security focused review.
Book an architecture reviewFrequently asked questions
- What does this tester do?
- It sends your system prompt plus a series of adversarial user prompts to an LLM provider you choose, then checks whether the model leaked the system prompt, complied with role overrides, or repeated trigger phrases the prompt was supposed to refuse. The verdict per probe is Defended, Partial leak, Failed, or Error.
- Why bring my own API key?
- Running probes costs real provider credits. Hosting the keys on our side would mean paying every visitor's bill and gating access. BYO key keeps the tool free, fast, and private. Your key is held only in this browser tab. It never reaches our backend; the requests go directly from your browser to OpenAI or Anthropic.
- Does this catch every prompt injection?
- No. The probes cover common categories (direct injection, role override, indirect injection, prompt leak, jailbreak, tool misuse) but every defense has gaps. Use this tool as a smoke test, not a security audit. If your system handles money, identity, or compliance, follow up with a focused red team engagement.
- What is indirect prompt injection?
- Indirect prompt injection happens when malicious instructions are embedded in content the model is asked to read (a web page, an email, a PDF) rather than typed by the user. Defenses must treat any external content as untrusted, isolate it from system instructions, and refuse to execute instructions that originate inside attached content.
- Why did the model leak the system prompt?
- Most general purpose models will repeat their system prompt if asked plainly. Defenses include explicit refusal instructions in the system prompt, RAG style separation between trust contexts, and post processing that detects prompt fragments before reply. The OWASP LLM Top-10 lists prompt leakage as a recurring risk class.
- Can I add my own adversarial prompts?
- Not via the UI yet. The corpus is loaded from a JSON file in the repo. If you want to test a specific attack, fork the repo and add it. The corpus is intentionally small and curated so each probe represents a category rather than a brute force list.
- Will my prompt be logged?
- Not on our side. We do not run a server in the path between the input fields and the LLM provider. The provider does log inputs per their privacy policy. If your system prompt is sensitive, run the test against a model and account whose data retention policy you trust.
Related services and reading
From smoke test to production hardening.