LiVersa AI Hepatology E-Consult Tool Accuracy Study
- •UCSF researchers evaluated LiVersa, a customized LLM drafted for hepatology e-consult responses from January to March 2025.
- •Human reviewers rated 83% of LiVersa drafts as providing appropriate recommendations, though 3.4% posed severe harm risks.
- •OpenAI-o1 acted as a more stringent judge, identifying higher potential harm than human clinicians during quality assessment.
Researchers at the University of California San Francisco (UCSF) evaluated LiVersa, a customized large language model designed to assist with hepatology e-consults, between January and March 2025. The study analyzed 61 e-consult cases, primarily involving abnormal liver function tests (34%), hepatitis B (23%), and abnormal imaging (21%).
LiVersa drafts showed statistical similarity to human-written responses, with average word counts of 284 versus 264 (p=0.47) and sentence verbosity of 24 versus 25 words (p=0.44). According to human expert reviews, 83% of drafts provided appropriate case-specific recommendations, while 72% served as reasonable starting points. However, 10% contained misleading information, and 3.4% carried a risk of severe harm.
The researchers also compared human reviewers to an "LLM-as-a-judge" approach using the OpenAI-o1 model. Human experts rated 48% of drafts as clinically equivalent, whereas the model-based reviewers were more conservative, rating only 27% as equivalent and 67% as potentially harmful. Despite this discrepancy, both reviewer types showed agreement on key accuracy metrics (TOST p<0.05). Findings indicate that while LLMs show promise for drafting clinical responses, the presence of potential harm necessitates mandatory human oversight during the implementation process.