AI-agent security

When an AI agent reads a web page, a sub-agent reply, or a package’s files, that text becomes part of what the model “knows” — and it can be turned into a command. These tools draw the line between data the agent reads and instructions the agent follows.

Why this matters

This class of attack is indirect prompt injection, and OWASP ranks prompt injection as LLM01 — the number-one risk in its Top 10 for LLM Applications. It isn’t theoretical: researchers have documented agents steered into leaking secrets and running commands from content they merely read.

The injected instructions are invisible to a human — hidden in zero-width Unicode, off-screen CSS, HTML comments, or planted config files — but read loud and clear by the model, which has no way to know the text came from an untrusted source. Left unguarded, an agent can rewrite your CLAUDE.md and poison every future session, lift a “diagnostic” command off a page that exfiltrates your credentials, or run an attacker’s payload through a node -e one-liner. In the May 2026 TrapDoor campaign, malicious packages did exactly this — planting hidden instructions in CLAUDE.md / .cursorrules to turn AI assistants into accomplices.

Full explainer: What is prompt injection? — how it works, why models fall for it, the attack vectors, and how to defend.

Further reading: OWASP — LLM01: Prompt Injection · Palo Alto Unit 42 — AI agent prompt injection · TrapDoor (The Hacker News).

How safe-fetch stops indirect prompt injectionA poisoned web page is fetched through safe-fetch, which strips hidden instructions and wraps the result as untrusted data, so the agent reads it as data, not commands.Poisoned web pagehidden instructionssafe-fetchstrips the trick,wraps as UNTRUSTED-WEBAgentreads data, not commands
A booby-trapped page can’t become a command: safe-fetch labels it as untrusted data first.

The tools

safe-fetch

A Docker-isolated URL fetcher for AI agents. It fetches inside a hardened, throw-away container, strips the hidden injection vectors (invisible Unicode, hidden/encoded HTML, fake delimiters), and wraps the result in <UNTRUSTED-WEB> tags so the agent treats it as data. Optional Claude Code hooks add the model rule that enforces the boundary, plus a safe-fetch search that runs search results through the same path.

brew install sharkyger/tap/safe-fetch
safe-fetch https://example.com
safe-fetch --install-claude-hooks

github.com/sharkyger/safe-fetch · MIT

mcp-safe-fetch

The same sanitiser as an MCP server, for Claude Desktop and other MCP clients — plus an SSRF guard that rejects IP-literals and pins the resolved IP across redirects, so a fetch can’t be steered at internal services.

docker pull ghcr.io/sharkyger/mcp-safe-fetch:latest

Add it to claude_desktop_config.json as an MCP server, restart Claude Desktop, and add the untrusted-data rule to your project instructions. (macOS for now.) github.com/sharkyger/mcp-safe-fetch · MIT

claude-code-prompt-injection-gate

The hooks that stop fetched or sub-agent text from being run as instructions in Claude Code — and gate writes to the files an attacker would target. Fetch hooks block non-allowlisted hosts and inline node -e/python -c fetches; a sub-agent hook wraps output as untrusted; a Write/Edit gate requires an operator-minted marker before anything touches CLAUDE.md, settings, hooks or project memory.

brew install sharkyger/tap/safe-fetch
safe-fetch --install-claude-hooks

github.com/sharkyger/claude-code-prompt-injection-gate · MIT


New to the topic?

Start with the full explainer — What is prompt injection?.

And the same TrapDoor campaign also planted malicious packages — the other half of the threat — see the supply-chain gates that block them before they install.