When building MCP servers, tool responses can become prompt injection vectors because attacker-controlled data (request paths, headers, error details) may end up in fields that LLMs treat as trusted instructions. The solution is to explicitly separate trusted guidance from untrusted evidence in the response structure: trusted fields are generated only from server-controlled values (enums, counters, static templates), while raw attacker-controlled data is isolated under clearly labeled 'untrustedData' fields. The MCP outputSchema should also annotate trust boundaries so clients and models have explicit signals. Regression tests should inject hostile strings and assert they never appear in trusted fields. For broader agent workflows, runtime guards can scan fetched content and apply rate limits before untrusted input reaches the LLM.

6m read timeFrom blog.arcjet.com
Post cover image
Table of contents
Tool output is model inputWhat the MCP spec saysThe pattern we useSchema text is part of the defenseTesting the boundaryRuntime enforcement with GuardsChecklist for agent tool builders

Sort: