What are the key points?

Gemini 2.5 outperformed competitors in radiolucent and mixed-density jaw lesion radiographic analysis. ChatGPT 4.0 demonstrated superior performance in evaluating radiopaque jaw lesions among the tested models. Study shows significant variability in diagnostic accuracy across three LLMs, requiring further validation before clinical use.

LLM Performance in Radiographic Jaw Lesion Detection

•Gemini 2.5 outperformed competitors in radiolucent and mixed-density jaw lesion radiographic analysis.
•ChatGPT 4.0 demonstrated superior performance in evaluating radiopaque jaw lesions among the tested models.
•Study shows significant variability in diagnostic accuracy across three LLMs, requiring further validation before clinical use.

Researchers evaluated the diagnostic accuracy of three AI chatbots—ChatGPT 4.0, Gemini 2.5, and Microsoft Copilot—in identifying jaw lesions using 120 panoramic radiographs. The study, published in Diagnostics on July 1, 2026, required each model to analyze images featuring mixed, radiolucent, and radiopaque lesion densities. Standardized scoring criteria included morphology, border characteristics, effects on adjacent structures, and overall biological behavior indicators.

Statistical analysis using the Kruskal–Wallis test revealed significant performance differences across the models. Gemini 2.5 achieved the highest diagnostic scores for radiolucent lesions (11.49 ± 4.97) and mixed-density lesions (9.01 ± 5.78). In contrast, ChatGPT 4.0 performed best when analyzing radiopaque lesions (10.93 ± 2.88). Microsoft Copilot consistently recorded the lowest diagnostic scores across all lesion categories studied.

The authors concluded that while these large language models show potential as supportive clinical tools for radiographic evaluation, their variable performance necessitates further validation before adoption in routine dental practice. The study highlights that diagnostic capabilities remain dependent on the specific model and the radiographic pattern of the lesion.

Researchers evaluated the diagnostic accuracy of three AI chatbots—ChatGPT 4.0, Gemini 2.5, and Microsoft Copilot—in identifying jaw lesions using 120 panoramic radiographs. The study, published in Diagnostics on July 1, 2026, required each model to analyze images featuring mixed, radiolucent, and radiopaque lesion densities. Standardized scoring criteria included morphology, border characteristics, effects on adjacent structures, and overall biological behavior indicators.

Statistical analysis using the Kruskal–Wallis test revealed significant performance differences across the models. Gemini 2.5 achieved the highest diagnostic scores for radiolucent lesions (11.49 ± 4.97) and mixed-density lesions (9.01 ± 5.78). In contrast, ChatGPT 4.0 performed best when analyzing radiopaque lesions (10.93 ± 2.88). Microsoft Copilot consistently recorded the lowest diagnostic scores across all lesion categories studied.

The authors concluded that while these large language models show potential as supportive clinical tools for radiographic evaluation, their variable performance necessitates further validation before adoption in routine dental practice. The study highlights that diagnostic capabilities remain dependent on the specific model and the radiographic pattern of the lesion.