Expert of Experts Verification and Alignment (EVAL) Framework for Large Language Models Safety in Gastroenterology DOI Creative Commons
Mauro Giuffrè, Kisung You, Zengchang Pang

и другие.

npj Digital Medicine, Год журнала: 2025, Номер 8(1)

Опубликована: Май 3, 2025

Large language models generate plausible text responses to medical questions, but inaccurate pose significant risks in decision-making. Grading LLM outputs determine the best model or answer is time-consuming and impractical clinical settings; therefore, we introduce EVAL (Expert-of-Experts Verification Alignment) streamline this process enhance safety for upper gastrointestinal bleeding (UGIB). We evaluated OpenAI's GPT-3.5/4/4o/o1-preview, Anthropic's Claude-3-Opus, Meta's LLaMA-2 (7B/13B/70B), Mistral AI's Mixtral (7B) across 27 configurations, including zero-shot baseline, retrieval-augmented generation, supervised fine-tuning. uses similarity-based ranking a reward trained on human-graded rejection sampling. Among employed similarity metrics, Fine-Tuned ColBERT achieved highest alignment with human performance three separate datasets (ρ = 0.81-0.91). The replicated grading 87.9% of cases temperature settings significantly improved accuracy through sampling by 8.36% overall. offers scalable potential assess high-stakes

Язык: Английский

Large language models improve the identification of emergency department visits for symptomatic kidney stones DOI Creative Commons
Cosmin A. Bejan, Amy E. McCart Reed,

Matthew Mikula

и другие.

Scientific Reports, Год журнала: 2025, Номер 15(1)

Опубликована: Янв. 28, 2025

Recent advancements of large language models (LLMs) like generative pre-trained transformer 4 (GPT-4) have generated significant interest among the scientific community. Yet, potential these to be utilized in clinical settings remains largely unexplored. In this study, we investigated abilities multiple LLMs and traditional machine learning analyze emergency department (ED) reports determine if corresponding visits were due symptomatic kidney stones. Leveraging a dataset manually annotated ED reports, developed strategies enhance including prompt optimization, zero- few-shot prompting, fine-tuning, augmentation. Further, implemented fairness assessment bias mitigation methods investigate disparities by with respect race gender. A expert assessed explanations GPT-4 for its predictions they sound, factually correct, unrelated input prompt, or potentially harmful. The best results achieved (macro-F1 = 0.833, 95% confidence interval [CI] 0.826–0.841) GPT-3.5 0.796, CI 0.796–0.796). Ablation studies revealed that initial model benefits from fine-tuning. Adding demographic information prior disease history prompts allows make better decisions. Bias found exhibited no racial gender disparities, contrast GPT-3.5, which failed effectively diversity.

Язык: Английский

Процитировано

3

Generative AI Chatbots for Reliable Cancer Information: Evaluating web-search, multilingual, and reference capabilities of emerging large language models DOI Creative Commons
Bradley D. Menz, Natansh D. Modi, Ahmad Y. Abuhelwa

и другие.

European Journal of Cancer, Год журнала: 2025, Номер 218, С. 115274 - 115274

Опубликована: Фев. 4, 2025

Recent advancements in large language models (LLMs) enable real-time web search, improved referencing, and multilingual support, yet ensuring they provide safe health information remains crucial. This perspective evaluates seven publicly accessible LLMs-ChatGPT, Co-Pilot, Gemini, MetaAI, Claude, Grok, Perplexity-on three simple cancer-related queries across eight languages (336 responses: English, French, Chinese, Thai, Hindi, Nepali, Vietnamese, Arabic). None of the 42 English responses contained clinically meaningful hallucinations, whereas 7 294 non-English did. 48 % (162/336) included valid references, but 39 references were.com links reflecting quality concerns. frequently exceeded an eighth-grade level, many outputs were also complex. These findings reflect substantial progress over past 2-years reveal persistent gaps accuracy, reliable reference inclusion, referral practices, readability. Ongoing benchmarking is essential to ensure LLMs safely support global dichotomy meet online standards.

Язык: Английский

Процитировано

3

Bridging the gap: a practical step-by-step approach to warrant safe implementation of large language models in healthcare DOI Creative Commons
Jessica D. Workum, Davy van de Sande, Diederik Gommers

и другие.

Frontiers in Artificial Intelligence, Год журнала: 2025, Номер 8

Опубликована: Янв. 27, 2025

Large Language Models (LLMs) offer considerable potential to enhance various aspects of healthcare, from aiding with administrative tasks clinical decision support. However, despite the growing use LLMs in a critical gap persists clear, actionable guidelines available healthcare organizations and providers ensure their responsible safe implementation. In this paper, we propose practical step-by-step approach bridge support warranting implementation into healthcare. The recommendations manuscript include protecting patient privacy, adapting models healthcare-specific needs, adjusting hyperparameters appropriately, ensuring proper medical prompt engineering, distinguishing between (CDS) non-CDS applications, systematically evaluating LLM outputs using structured approach, implementing solid model governance structure. We furthermore ACUTE mnemonic; for assessing responses based on Accuracy, Consistency, semantically Unaltered outputs, Traceability, Ethical considerations. Together, these aim provide clear pathway practice.

Язык: Английский

Процитировано

2

Prompt injection attacks on vision language models in oncology DOI Creative Commons
Jan Clusmann, Dyke Ferber, Isabella C. Wiest

и другие.

Nature Communications, Год журнала: 2025, Номер 16(1)

Опубликована: Фев. 1, 2025

Язык: Английский

Процитировано

2

Evaluating Adherence to Canadian Radiology Guidelines for Incidental Hepatobiliary Findings Using RAG-Enabled LLMs DOI Creative Commons

Nicholas Dietrich,

Brett Stubbert

Canadian Association of Radiologists Journal, Год журнала: 2025, Номер unknown

Опубликована: Фев. 27, 2025

Purpose: Large language models (LLMs) have the potential to support clinical decision-making but often lack training on latest guidelines. Retrieval-augmented generation (RAG) may enhance guideline adherence by dynamically integrating external information. This study evaluates performance of two LLMs, GPT-4o and o1-mini, with without RAG, in adhering Canadian radiology guidelines for incidental hepatobiliary findings. Methods: A customized RAG architecture was developed integrate guideline-based recommendations into LLM prompts. Clinical cases were curated used prompt RAG. Primary analyses assessed rate comparisons made between LLMs Secondary evaluated reading ease, grade level, response times generated outputs. Results: total 319 evaluated. Adherence rates 81.7% 97.2% 79.3% o1-mini 95.1% Model differed significantly across groups, RAG-enabled configurations outperforming their non-RAG counterparts ( P < .05). demonstrated improved ease lower level scores; however, all model outputs remained at advanced comprehension levels. Response increased slightly due additional retrieval processing clinically acceptable. Conclusions: findings compromising readability or times. approach holds promise advancing evidence-based care warrants further validation broader settings.

Язык: Английский

Процитировано

2

Use of large language models as clinical decision support tools for management pancreatic adenocarcinoma using National Comprehensive Cancer Network guidelines DOI
Kristen Kaiser, Alexa J. Hughes, Anthony D. Yang

и другие.

Surgery, Год журнала: 2025, Номер unknown, С. 109267 - 109267

Опубликована: Март 1, 2025

Язык: Английский

Процитировано

2

Application of large language models in medicine DOI
Fenglin Liu, Hongjian Zhou, 博司 熊谷

и другие.

Nature Reviews Bioengineering, Год журнала: 2025, Номер unknown

Опубликована: Апрель 7, 2025

Язык: Английский

Процитировано

2

Accuracy and consistency of publicly available Large Language Models as clinical decision support tools for the management of colon cancer DOI

Kristen N. Kaiser,

Alexa J. Hughes, Anthony D. Yang

и другие.

Journal of Surgical Oncology, Год журнала: 2024, Номер unknown

Опубликована: Авг. 19, 2024

Large Language Models (LLM; e.g., ChatGPT) may be used to assist clinicians and form the basis of future clinical decision support (CDS) for colon cancer. The objectives this study were (1) evaluate response accuracy two LLM-powered interfaces in identifying guideline-based care simulated scenarios (2) define variation between within LLMs.

Язык: Английский

Процитировано

9

Large language models could make natural language again the universal interface of healthcare DOI
Jakob Nikolas Kather, Dyke Ferber, Isabella C. Wiest

и другие.

Nature Medicine, Год журнала: 2024, Номер 30(10), С. 2708 - 2710

Опубликована: Авг. 23, 2024

Язык: Английский

Процитировано

9

Use of Artificial Intelligence for Liver Diseases: A Survey from the EASL Congress 2024 DOI Creative Commons
Laura Žigutytė, Thomas Sorz, Jan Clusmann

и другие.

JHEP Reports, Год журнала: 2024, Номер 6(12), С. 101209 - 101209

Опубликована: Сен. 6, 2024

Язык: Английский

Процитировано

8