What are the key points?

AWS released four multimodal evaluators for image-to-text tasks in the Strands Evals SDK. The new tools automatically catch hallucinations and factual errors by judging outputs directly against source images. Anthropic Claude Sonnet 4.6 is the default judge model, offering optimal accuracy-to-cost ratios for enterprise applications.

AWS Launches Multimodal Evaluators for Image-to-Text Tasks

•AWS released four multimodal evaluators for image-to-text tasks in the Strands Evals SDK.
•The new tools automatically catch hallucinations and factual errors by judging outputs directly against source images.
•Anthropic Claude Sonnet 4.6 is the default judge model, offering optimal accuracy-to-cost ratios for enterprise applications.

Amazon Web Services (AWS) launched four new multimodal evaluators within the Strands Evals software development kit (SDK) to automate the verification of image-to-text outputs. These tools target common failure modes in visual tasks—such as document extraction, chart interpretation, and screenshot summarization—where text-only evaluation often fails to detect hallucinations or grounding errors. Gartner projects that 80% of enterprise software will be multimodal by 2030, compared to less than 10% in 2024, driving the need for automated quality assessment.

The new evaluators include Overall Quality, Correctness, Faithfulness, and Instruction Following. Each tool processes the source image alongside the query and model response, outputting both a score and a diagnostic reasoning string. Overall Quality provides a Likert 1-5 score, while the other three provide binary pass/fail judgments. These evaluators support both reference-based mode, which compares outputs against a gold standard, and reference-free mode for production environments lacking ground truth. Developers can integrate these into existing workflows to automatically detect issues like unsupported inferences, format violations, and factual inaccuracies.

Testing conducted by the team shows that using a multimodal judge model directly leads to higher alignment with human scores than using text-only descriptions of images. For the judge model, Anthropic Claude Sonnet 4.6 on Amazon Bedrock served as the default, balancing accuracy, cost, and latency. The researchers found that requiring the judge to output reasoning before scoring significantly improved alignment with human assessment. Additionally, they recommended using diverse calibration examples and multi-dimensional rubrics rather than holistic prompts to differentiate between distinct types of errors. The evaluators are now available for deployment within the Strands Evals framework to streamline debugging and continuous integration pipelines.

Amazon Web Services (AWS) launched four new multimodal evaluators within the Strands Evals software development kit (SDK) to automate the verification of image-to-text outputs. These tools target common failure modes in visual tasks—such as document extraction, chart interpretation, and screenshot summarization—where text-only evaluation often fails to detect hallucinations or grounding errors. Gartner projects that 80% of enterprise software will be multimodal by 2030, compared to less than 10% in 2024, driving the need for automated quality assessment.

The new evaluators include Overall Quality, Correctness, Faithfulness, and Instruction Following. Each tool processes the source image alongside the query and model response, outputting both a score and a diagnostic reasoning string. Overall Quality provides a Likert 1-5 score, while the other three provide binary pass/fail judgments. These evaluators support both reference-based mode, which compares outputs against a gold standard, and reference-free mode for production environments lacking ground truth. Developers can integrate these into existing workflows to automatically detect issues like unsupported inferences, format violations, and factual inaccuracies.

Testing conducted by the team shows that using a multimodal judge model directly leads to higher alignment with human scores than using text-only descriptions of images. For the judge model, Anthropic Claude Sonnet 4.6 on Amazon Bedrock served as the default, balancing accuracy, cost, and latency. The researchers found that requiring the judge to output reasoning before scoring significantly improved alignment with human assessment. Additionally, they recommended using diverse calibration examples and multi-dimensional rubrics rather than holistic prompts to differentiate between distinct types of errors. The evaluators are now available for deployment within the Strands Evals framework to streamline debugging and continuous integration pipelines.