Don't Make Models Guess Security and Safety: Symbolic Guardrails for Domain-Specific AI Agents

There is increasing interest in integrating AI agents that invoke tools into domain-specific commercial software, where unintended tool calls can cause serious security and safety incidents. This has drawn growing research attention, and many agent security and safety benchmarks have emerged. They implicitly shape how the community approaches security and safety. Yet existing work exhibits a blind spot: it emphasizes training-based methods and neural guardrails, which reduce the likelihood of insecure or unsafe actions but cannot guarantee their prevention. It generally overlooks opportunities for deductive, symbolic guardrails grounded in standard software engineering practices, which can provide guarantees for some security and safety requirements. Our study has three parts: (1) a systematic review of 80 agent security and safety benchmarks finding that that 85% of benchmarks do not state verifiable requirements (61% provide none, and 24% give only high-level goals); (2) an applicability analysis of which security and safety requirements symbolic guardrails can and cannot enforce on τ^2-Bench, CAR-bench, and MedAgentBench, finding that 74% of requirements are symbolically enforceable and 95% of these need only simple, low-cost checks; and (3) an empirical evaluation of symbolic guardrails on the same three benchmarks, finding that symbolic guardrails improve security and safety without sacrificing utility, and often improve it. Our work draws attention to the potential for symbolic guardrails for AI agents, suggesting them as an overlooked but practical path toward deploying domain-specific AI agents in risk-averse commercial software. We release all codes and artifacts at https://github.com/hyn0027/agent-symbolic-guardrails.

Don't Make Models Guess Security and Safety: Symbolic Guardrails for Domain-Specific AI Agents

Abstract

Authors