What are the key points?

UCSF researchers evaluated LiVersa, a customized LLM drafted for hepatology e-consult responses from January to March 2025. Human reviewers rated 83% of LiVersa drafts as providing appropriate recommendations, though 3.4% posed severe harm risks. OpenAI-o1 acted as a more stringent judge, identifying higher potential harm than human clinicians during quality assessment.

LiVersa AI Hepatology E-Consult Tool Accuracy Study

•UCSF researchers evaluated LiVersa, a customized LLM drafted for hepatology e-consult responses from January to March 2025.
•Human reviewers rated 83% of LiVersa drafts as providing appropriate recommendations, though 3.4% posed severe harm risks.
•OpenAI-o1 acted as a more stringent judge, identifying higher potential harm than human clinicians during quality assessment.

Researchers at the University of California San Francisco (UCSF) evaluated LiVersa, a customized large language model designed to assist with hepatology e-consults, between January and March 2025. The study analyzed 61 e-consult cases, primarily involving abnormal liver function tests (34%), hepatitis B (23%), and abnormal imaging (21%).

LiVersa drafts showed statistical similarity to human-written responses, with average word counts of 284 versus 264 (p=0.47) and sentence verbosity of 24 versus 25 words (p=0.44). According to human expert reviews, 83% of drafts provided appropriate case-specific recommendations, while 72% served as reasonable starting points. However, 10% contained misleading information, and 3.4% carried a risk of severe harm.

The researchers also compared human reviewers to an "LLM-as-a-judge" approach using the OpenAI-o1 model. Human experts rated 48% of drafts as clinically equivalent, whereas the model-based reviewers were more conservative, rating only 27% as equivalent and 67% as potentially harmful. Despite this discrepancy, both reviewer types showed agreement on key accuracy metrics (TOST p<0.05). Findings indicate that while LLMs show promise for drafting clinical responses, the presence of potential harm necessitates mandatory human oversight during the implementation process.

Researchers at the University of California San Francisco (UCSF) evaluated LiVersa, a customized large language model designed to assist with hepatology e-consults, between January and March 2025. The study analyzed 61 e-consult cases, primarily involving abnormal liver function tests (34%), hepatitis B (23%), and abnormal imaging (21%).

LiVersa drafts showed statistical similarity to human-written responses, with average word counts of 284 versus 264 (p=0.47) and sentence verbosity of 24 versus 25 words (p=0.44). According to human expert reviews, 83% of drafts provided appropriate case-specific recommendations, while 72% served as reasonable starting points. However, 10% contained misleading information, and 3.4% carried a risk of severe harm.

The researchers also compared human reviewers to an "LLM-as-a-judge" approach using the OpenAI-o1 model. Human experts rated 48% of drafts as clinically equivalent, whereas the model-based reviewers were more conservative, rating only 27% as equivalent and 67% as potentially harmful. Despite this discrepancy, both reviewer types showed agreement on key accuracy metrics (TOST p<0.05). Findings indicate that while LLMs show promise for drafting clinical responses, the presence of potential harm necessitates mandatory human oversight during the implementation process.