Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health records DOI Creative Commons
Judit M. Wulcan,

Kevin L Jacques,

Mary Ann Lee

и другие.

Frontiers in Veterinary Science, Год журнала: 2025, Номер 11

Опубликована: Янв. 16, 2025

Large language models (LLMs) can extract information from veterinary electronic health records (EHRs), but performance differences between models, the effect of hyperparameter settings, and influence text ambiguity have not been previously evaluated. This study addresses these gaps by comparing GPT-4 omni (GPT-4o) GPT-3.5 Turbo under different conditions investigating relationship human interobserver agreement LLM errors. The LLMs five humans were tasked with identifying six clinical signs associated feline chronic enteropathy in 250 EHRs a referral hospital. When compared to majority opinion respondents, GPT-4o demonstrated 96.9% sensitivity [interquartile range (IQR) 92.9-99.3%], 97.6% specificity (IQR 96.5-98.5%), 80.7% positive predictive value 70.8-84.6%), 99.5% negative 99.0-99.9%), 84.4% F1 score 77.3-90.4%), 96.3% balanced accuracy 95.0-97.9%). was significantly better than that its predecessor, Turbo, particularly respect where only achieved 81.7% 78.9-84.8%). greater reproducibility pairs, an average Cohen's kappa 0.98 0.98-0.99) 0.80 0.78-0.81) humans. Most errors occurred instances disagreed [35/43 (81.4%)], suggesting more likely caused EHR explicit model faults. Using automate extraction is viable alternative manual extraction, requires validation for intended setting ensure reliability.

Язык: Английский

Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health records DOI Creative Commons
Judit M. Wulcan,

Kevin L Jacques,

Mary Ann Lee

и другие.

Frontiers in Veterinary Science, Год журнала: 2025, Номер 11

Опубликована: Янв. 16, 2025

Large language models (LLMs) can extract information from veterinary electronic health records (EHRs), but performance differences between models, the effect of hyperparameter settings, and influence text ambiguity have not been previously evaluated. This study addresses these gaps by comparing GPT-4 omni (GPT-4o) GPT-3.5 Turbo under different conditions investigating relationship human interobserver agreement LLM errors. The LLMs five humans were tasked with identifying six clinical signs associated feline chronic enteropathy in 250 EHRs a referral hospital. When compared to majority opinion respondents, GPT-4o demonstrated 96.9% sensitivity [interquartile range (IQR) 92.9-99.3%], 97.6% specificity (IQR 96.5-98.5%), 80.7% positive predictive value 70.8-84.6%), 99.5% negative 99.0-99.9%), 84.4% F1 score 77.3-90.4%), 96.3% balanced accuracy 95.0-97.9%). was significantly better than that its predecessor, Turbo, particularly respect where only achieved 81.7% 78.9-84.8%). greater reproducibility pairs, an average Cohen's kappa 0.98 0.98-0.99) 0.80 0.78-0.81) humans. Most errors occurred instances disagreed [35/43 (81.4%)], suggesting more likely caused EHR explicit model faults. Using automate extraction is viable alternative manual extraction, requires validation for intended setting ensure reliability.

Язык: Английский

Процитировано

0