Amazon Releases Automated AI Agent Failure Diagnostics
- •AWS released Strands Evals SDK detectors to automate root cause analysis for failed AI agent traces.
- •The tool categorizes failures into nine types and maps causal chains to distinguish root issues from symptoms.
- •Integration with CI/CD pipelines allows automated diagnosis and fix recommendations for agents in production.
Amazon Bedrock users can now automate the diagnostic process for failed AI agents using the newly released detectors in the Strands Evals SDK. While traditional evaluation metrics confirm when an agent fails—such as a drop in goal success rates from 85 percent to 70 percent—they do not explain why or how to resolve the issue. The new detector pipeline automates this workflow by analyzing execution traces span-by-span, identifying failures, and offering actionable recommendations, reducing the time spent on manual diagnosis from hours to minutes.
The detection process operates in two distinct phases powered by LLM-based analysis. Phase 1 performs failure detection across a taxonomy of nine categories, including hallucination, orchestration errors, and incorrect actions. It maps identified failures to specific span locations and provides evidence extracted from execution traces. Phase 2 conducts root cause analysis to link these failures into causal chains. By classifying failures as primary, secondary, or tertiary, the tool distinguishes between the root issue and downstream symptoms, specifically recommending whether a fix requires updates to the system prompt or tool definitions.
Developers can integrate these diagnostic tools directly into their CI/CD evaluation pipelines using the DiagnosisConfig feature. Two trigger modes are available: ON_FAILURE, which runs only when a test fails to save on LLM inference costs, and ALWAYS, which performs analysis on every case to identify suboptimal behavior even in successful sessions. The SDK also includes a CloudWatchProvider to fetch production traces directly from Amazon CloudWatch Logs for historical analysis. These detectors are framework-agnostic, supporting traces from any system that exports OpenTelemetry data, including LangChain and Strands Agents.
Best practices for implementation involve starting with a MEDIUM confidence threshold to balance signal accuracy and noise. Developers are advised to prioritize fixes for primary failures, as these often resolve the cascading secondary and tertiary symptoms. Because the diagnostic tools utilize Amazon Bedrock for analysis, users should monitor usage costs via AWS Cost Explorer, particularly when configuring pipelines for frequent execution.