Building enterprise AI agents requires defense-in-depth safety architecture. This post shares lessons from implementing a two-layer shield system using Llama Stack: PromptGuard (86M params, CPU-based) as the first layer for detecting prompt injections and jailbreaks, and Llama Guard 3 8B (GPU-based) as the second layer for content safety across 14 categories. Key findings include the need to exclude certain safety categories (Privacy, Specialized Advice, Code Interpreter Abuse, Self-Harm) to prevent false positives in legitimate IT workflows like employee lookups. A notable challenge arose with LangGraph-based small prompt architectures, where PromptGuard incorrectly flagged internal state machine prompts as injection attempts rather than evaluating actual user input. The post emphasizes that safety shields require careful tuning per use case and cannot simply be enabled and left unattended.

11m read timeFrom developers.redhat.com
Post cover image
Table of contents
About AI quickstartsLlama Stack shield architectureConfiguration: Defense-in-depth in practiceThe false positive problem: When safety becomes too safeThe small prompt challenge: Shield validation issuesClosing thoughtsNext steps

Sort: