What are the key points?

Single-judge LLM evaluations produce inconsistent scores, with models swinging up to 47 points based on the grader. Opus-4-7 maintained first place across all judges, but lower-ranked model positions shifted significantly depending on the scorer. Using multiple judges and binary rubric criteria is recommended to reduce bias and improve benchmark stability.

LLM Judge Bias Distorts Agent Benchmark Rankings

•Single-judge LLM evaluations produce inconsistent scores, with models swinging up to 47 points based on the grader.
•Opus-4-7 maintained first place across all judges, but lower-ranked model positions shifted significantly depending on the scorer.
•Using multiple judges and binary rubric criteria is recommended to reduce bias and improve benchmark stability.

A new benchmark analysis reveals that using a single LLM to grade agent performance creates significant scoring volatility, with individual models swinging as much as 47 percentage points on specific skills depending on the judge. The evaluation tested six models across eleven agent skills, finding that judge selection is a primary driver of reported success metrics rather than purely model capability. Researchers at Tessl conducted the study, grading each model's output independently using three different scorers: Sonnet, GPT-5.5, and Opus-4-7. Results indicate that Sonnet is the most generous grader, while GPT-5.5 acts as the strictest evaluator, with an average scoring gap of 6.9 points between them.

Rankings are frequently unstable; while opus-4-7 consistently maintained the top position across all three judges, the relative performance of other models shifted significantly. For example, the model gpt-5.3 placed third when graded by Sonnet but dropped to fifth under both GPT-5.5 and Opus-4-7. The study also observed self-judge bias, where Opus-4-7 awarded itself a 4.6-point boost compared to the average scores provided by the other two judges. In contrast, GPT-5.5 did not exhibit the same self-favoring pattern.

The researchers attribute these discrepancies to how different judges handle output precision. Generous judges often provide partial credit for outputs that are approximately correct, while stricter judges penalize deviations from specific requirements. Consequently, skills requiring qualitative assessment are susceptible to swing variances of up to 25 percentage points, whereas tasks defined by binary, verifiable outcomes—such as checking if a file was successfully deleted—remain stable regardless of the judge.

To improve the reliability of evaluation metrics, the researchers advise running multiple judges and averaging the results to smooth out individual preferences. They also recommend designing rubrics around binary criteria whenever possible to minimize subjective interpretation. For tasks where precision is critical, the study suggests that stricter judges like GPT-5.5 offer more informative data on whether an agent follows specifications exactly rather than merely approximating them.

A new benchmark analysis reveals that using a single LLM to grade agent performance creates significant scoring volatility, with individual models swinging as much as 47 percentage points on specific skills depending on the judge. The evaluation tested six models across eleven agent skills, finding that judge selection is a primary driver of reported success metrics rather than purely model capability. Researchers at Tessl conducted the study, grading each model's output independently using three different scorers: Sonnet, GPT-5.5, and Opus-4-7. Results indicate that Sonnet is the most generous grader, while GPT-5.5 acts as the strictest evaluator, with an average scoring gap of 6.9 points between them.

Rankings are frequently unstable; while opus-4-7 consistently maintained the top position across all three judges, the relative performance of other models shifted significantly. For example, the model gpt-5.3 placed third when graded by Sonnet but dropped to fifth under both GPT-5.5 and Opus-4-7. The study also observed self-judge bias, where Opus-4-7 awarded itself a 4.6-point boost compared to the average scores provided by the other two judges. In contrast, GPT-5.5 did not exhibit the same self-favoring pattern.

The researchers attribute these discrepancies to how different judges handle output precision. Generous judges often provide partial credit for outputs that are approximately correct, while stricter judges penalize deviations from specific requirements. Consequently, skills requiring qualitative assessment are susceptible to swing variances of up to 25 percentage points, whereas tasks defined by binary, verifiable outcomes—such as checking if a file was successfully deleted—remain stable regardless of the judge.

To improve the reliability of evaluation metrics, the researchers advise running multiple judges and averaging the results to smooth out individual preferences. They also recommend designing rubrics around binary criteria whenever possible to minimize subjective interpretation. For tasks where precision is critical, the study suggests that stricter judges like GPT-5.5 offer more informative data on whether an agent follows specifications exactly rather than merely approximating them.