Securing AI Agents With Automated Red Teaming
- •Morgan Willis reduced AI agent security breaches from 6/9 to 1/9 using multi-layered defense strategies.
- •Automated red teaming identified vulnerabilities including credential exfiltration, prompt leakage, and excessive tool agency.
- •Architectural solutions like virtualized sandboxing and identity-scoped gateways successfully offload security from the model's judgment.
Morgan Willis, a developer, demonstrated that AI agents with unrestricted tool access can leak sensitive data, such as AWS credentials, through adversarial prompting. Using a productivity agent equipped with bash and employee directory tools, Willis tested vulnerabilities by manually finding paths to secret keys and system prompts. Initial testing revealed that 6 out of 9 automatically generated adversarial cases resulted in security breaches. The agent successfully exfiltrated credentials, accessed cross-employee data, and leaked its own system instructions when prompted creatively.
To secure the infrastructure, the developer implemented a virtualized filesystem sandbox, which prevented the agent from accessing restricted local files like credential stores. However, re-testing indicated that 6 out of 9 cases still breached security due to application-layer weaknesses, such as excessive agency and system prompt leakage. Willis then applied multi-layered defenses: Steering, using an LLM to evaluate semantic intent and refuse unauthorized disclosures; Cedar Authorization, a policy-based system that denies all tool calls not explicitly permitted; and Amazon Bedrock Guardrails, which filter out-of-scope requests and redact sensitive information like PII.
With these defenses active, the breach count fell to 1 out of 9 cases. The remaining vulnerability involved cross-employee data access, which architectural guardrails cannot solve because the agent was still authorized to call the lookup tool. Willis concluded that true data security requires moving authentication to the infrastructure level. By using an AgentCore Gateway with an interceptor, the system automatically injects the authenticated user's ID into tool calls based on their JWT (JSON Web Token used for secure identity verification). This ensures that identity validation remains outside the control of the AI agent process, preventing the agent from acting on behalf of unauthorized users.