What are the key points?

GPT-4o fails to match traditional SHAP and IG explainers in medical text feature attribution reliability. Study evaluates GPT-4o as an autonomous explainer on 80,901 tokens across 200 diverse clinical research studies. GPT-4o achieved significantly lower AOPC faithfulness scores (0.025-0.029) than SHAP (0.222) and IG (0.225) baselines.

GPT-4o Struggles as Autonomous Explainer for Medical Text

•GPT-4o fails to match traditional SHAP and IG explainers in medical text feature attribution reliability.
•Study evaluates GPT-4o as an autonomous explainer on 80,901 tokens across 200 diverse clinical research studies.
•GPT-4o achieved significantly lower AOPC faithfulness scores (0.025-0.029) than SHAP (0.222) and IG (0.225) baselines.

A study published in JMIR Medical Informatics on June 10, 2026, evaluated the effectiveness of GPT-4o as an autonomous explainer for a BioLinkBERT model classifying medical text. Researchers Fan Zhou, Ashirbani Saha, and Cynthia Lokker compared GPT-4o against established interpretability methods, Shapley Additive Explanations (SHAP) and Integrated Gradients (IG), to determine feature importance in text classification.

The study utilized a stratified sample of 200 medical studies from the McMaster Premium Literature Service (PLUS) and Clinical Hedges databases, focusing on difficult, low-confidence predictions. The evaluation involved a feature space of 80,901 tokens across 6,369 unique identifiers. While SHAP (AOPC 0.222, 95% CI 0.200-0.244) and IG (AOPC 0.225, 95% CI 0.202-0.247) demonstrated high faithfulness and identified relevant clinical terms like “randomized,” GPT-4o showed significantly lower faithfulness scores (AOPC 0.025-0.029). Correlation analysis revealed that while SHAP and IG had a Pearson correlation of r=0.367, GPT-4o exhibited limited alignment with these baselines (r≤0.032).

Researchers concluded that GPT-4o lacks the reliability required for perturbation-based explainability, as it failed to synthesize feature importance accurately compared to traditional frameworks. The study also noted that iterative API calls made GPT-4o significantly more computationally expensive and slower than IG, which proved to be the most efficient method for identifying feature importance in clinical literature analysis.

A study published in JMIR Medical Informatics on June 10, 2026, evaluated the effectiveness of GPT-4o as an autonomous explainer for a BioLinkBERT model classifying medical text. Researchers Fan Zhou, Ashirbani Saha, and Cynthia Lokker compared GPT-4o against established interpretability methods, Shapley Additive Explanations (SHAP) and Integrated Gradients (IG), to determine feature importance in text classification.

The study utilized a stratified sample of 200 medical studies from the McMaster Premium Literature Service (PLUS) and Clinical Hedges databases, focusing on difficult, low-confidence predictions. The evaluation involved a feature space of 80,901 tokens across 6,369 unique identifiers. While SHAP (AOPC 0.222, 95% CI 0.200-0.244) and IG (AOPC 0.225, 95% CI 0.202-0.247) demonstrated high faithfulness and identified relevant clinical terms like “randomized,” GPT-4o showed significantly lower faithfulness scores (AOPC 0.025-0.029). Correlation analysis revealed that while SHAP and IG had a Pearson correlation of r=0.367, GPT-4o exhibited limited alignment with these baselines (r≤0.032).

Researchers concluded that GPT-4o lacks the reliability required for perturbation-based explainability, as it failed to synthesize feature importance accurately compared to traditional frameworks. The study also noted that iterative API calls made GPT-4o significantly more computationally expensive and slower than IG, which proved to be the most efficient method for identifying feature importance in clinical literature analysis.