What are the key points?

Study compares five LLMs for mitral valve surgery patient education across three primary dimensions. ChatGPT-4o and Gemini 2.5 Pro Preview achieve higher accuracy scores than other tested models. Claude 3.7 Sonnet provides the most readable, simplified content for patient communication purposes.

LLM Performance in Mitral Valve Surgery Patient Education

•Study compares five LLMs for mitral valve surgery patient education across three primary dimensions.
•ChatGPT-4o and Gemini 2.5 Pro Preview achieve higher accuracy scores than other tested models.
•Claude 3.7 Sonnet provides the most readable, simplified content for patient communication purposes.

Banu Bahriye Akdag, M. Bademci, and I. Peker evaluated five large language models—ChatGPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Pro Preview, DeepSeek-V3, and Microsoft Copilot—on their ability to answer seven common patient questions regarding mitral valve surgery (MVS). Published on June 29, 2026, in BMC Medical Informatics and Decision Making, the study measured model output across three criteria: accuracy, completeness, and readability.

Results showed statistically significant performance differences across all dimensions (p < 0.001). ChatGPT-4o and Gemini 2.5 Pro Preview outperformed others in accuracy, both achieving a median score of 5 compared to 4 for Claude 3.7 Sonnet and Microsoft Copilot. For completeness, Gemini 2.5 Pro Preview led with a median score of 5, while Claude 3.7 Sonnet scored 3 (p < 0.001). Conversely, Claude 3.7 Sonnet provided the most readable responses, scoring 10.90 on the SMOG Index and 8.0 on the Flesch-Kincaid Grade Level scale, compared to 12.24 and 9.04 for ChatGPT-4o respectively (p < 0.006 and p < 0.004). Researchers concluded that while LLMs show promise for patient education, they require professional clinical oversight due to variations in accuracy and completeness.

Banu Bahriye Akdag, M. Bademci, and I. Peker evaluated five large language models—ChatGPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Pro Preview, DeepSeek-V3, and Microsoft Copilot—on their ability to answer seven common patient questions regarding mitral valve surgery (MVS). Published on June 29, 2026, in BMC Medical Informatics and Decision Making, the study measured model output across three criteria: accuracy, completeness, and readability.

Results showed statistically significant performance differences across all dimensions (p < 0.001). ChatGPT-4o and Gemini 2.5 Pro Preview outperformed others in accuracy, both achieving a median score of 5 compared to 4 for Claude 3.7 Sonnet and Microsoft Copilot. For completeness, Gemini 2.5 Pro Preview led with a median score of 5, while Claude 3.7 Sonnet scored 3 (p < 0.001). Conversely, Claude 3.7 Sonnet provided the most readable responses, scoring 10.90 on the SMOG Index and 8.0 on the Flesch-Kincaid Grade Level scale, compared to 12.24 and 9.04 for ChatGPT-4o respectively (p < 0.006 and p < 0.004). Researchers concluded that while LLMs show promise for patient education, they require professional clinical oversight due to variations in accuracy and completeness.