What are the key points?

Study evaluates LLMs and prompt engineering performance for internal medicine clinical pharmacy tasks. OpenEvidence showed significantly higher reference validity than ChatGPT 4o with p < 0.001 in all trials. Neither domain-specific RAG models nor prompt templates improved overall response accuracy and completeness significantly.

LLM Performance Study in Clinical Pharmacy Cases

•Study evaluates LLMs and prompt engineering performance for internal medicine clinical pharmacy tasks.
•OpenEvidence showed significantly higher reference validity than ChatGPT 4o with p < 0.001 in all trials.
•Neither domain-specific RAG models nor prompt templates improved overall response accuracy and completeness significantly.

Researchers evaluated the effectiveness of Large Language Models (LLMs) and prompt engineering in internal medicine clinical pharmacy in a single-center, prospective study published in the Journal of the American College of Clinical Pharmacy on June 21, 2026. A clinical pharmacy specialist created 50 case questions for testing. The experiment used a two-by-two factorial design, comparing a general-purpose model, ChatGPT 4o, against a healthcare-specific system equipped with retrieval-augmented generation (RAG—a method connecting LLMs to external, verified data sources).

Additionally, the study assessed the impact of using a structured prompt engineering template to refine outputs. The primary endpoint measured a composite of response accuracy and completeness, with two pharmacists grading results and a third providing reconciliation. Results indicated no statistically significant interaction between the choice of model and the use of prompt templates for primary outcomes.

Predicted probability scores for meeting accuracy and completeness benchmarks were 0.54 for GPT without a template, 0.60 for GPT with a template, 0.64 for OpenEvidence without a template, and 0.52 for OpenEvidence with a template. However, OpenEvidence demonstrated significantly higher reference validity compared to ChatGPT 4o across all conditions (p < 0.001). While these methods did not improve overall accuracy or completeness, the researchers suggest that domain-specific systems show promise due to superior citation reliability. The study was conducted under the University of Maryland, Baltimore IRB, protocol HP-00112497.

Researchers evaluated the effectiveness of Large Language Models (LLMs) and prompt engineering in internal medicine clinical pharmacy in a single-center, prospective study published in the Journal of the American College of Clinical Pharmacy on June 21, 2026. A clinical pharmacy specialist created 50 case questions for testing. The experiment used a two-by-two factorial design, comparing a general-purpose model, ChatGPT 4o, against a healthcare-specific system equipped with retrieval-augmented generation (RAG—a method connecting LLMs to external, verified data sources).

Additionally, the study assessed the impact of using a structured prompt engineering template to refine outputs. The primary endpoint measured a composite of response accuracy and completeness, with two pharmacists grading results and a third providing reconciliation. Results indicated no statistically significant interaction between the choice of model and the use of prompt templates for primary outcomes.

Predicted probability scores for meeting accuracy and completeness benchmarks were 0.54 for GPT without a template, 0.60 for GPT with a template, 0.64 for OpenEvidence without a template, and 0.52 for OpenEvidence with a template. However, OpenEvidence demonstrated significantly higher reference validity compared to ChatGPT 4o across all conditions (p < 0.001). While these methods did not improve overall accuracy or completeness, the researchers suggest that domain-specific systems show promise due to superior citation reliability. The study was conducted under the University of Maryland, Baltimore IRB, protocol HP-00112497.