Cited by Assessing Large Language Models for Oncology Data Inference From Radiology Reports

A future role for health applications of large language models depends on regulators enforcing safety standards DOI

Oscar Freyer, Isabella C. Wiest, Jakob Nikolas Kather

и другие.

The Lancet Digital Health, Год журнала: 2024, Номер 6(9), С. e662 - e672

Опубликована: Авг. 23, 2024

Among the rapid integration of artificial intelligence in clinical settings, large language models (LLMs), such as Generative Pre-trained Transformer-4, have emerged multifaceted tools that potential for health-care delivery, diagnosis, and patient care. However, deployment LLMs raises substantial regulatory safety concerns. Due to their high output variability, poor inherent explainability, risk so-called AI hallucinations, LLM-based applications serve a medical purpose face challenges approval devices under US EU laws, including recently passed Artificial Intelligence Act. Despite unaddressed risks patients, misdiagnosis unverified advice, are available on market. The ambiguity surrounding these creates an urgent need frameworks accommodate unique capabilities limitations. Alongside development frameworks, existing regulations should be enforced. If regulators fear enforcing market dominated by supply or technology companies, consequences layperson harm will force belated action, damaging potentiality advice.

Язык: Английский

Процитировано

Large Language Models lack essential metacognition for reliable medical reasoning DOI

Maxime Griot,

Coralie Hemptinne, Jean Vanderdonckt

и другие.

Nature Communications, Год журнала: 2025, Номер 16(1)

Опубликована: Янв. 14, 2025

Язык: Английский

Процитировано

Large language models improve the identification of emergency department visits for symptomatic kidney stones DOI

Cosmin A. Bejan, Amy E. McCart Reed,

Matthew Mikula

и другие.

Scientific Reports, Год журнала: 2025, Номер 15(1)

Опубликована: Янв. 28, 2025

Recent advancements of large language models (LLMs) like generative pre-trained transformer 4 (GPT-4) have generated significant interest among the scientific community. Yet, potential these to be utilized in clinical settings remains largely unexplored. In this study, we investigated abilities multiple LLMs and traditional machine learning analyze emergency department (ED) reports determine if corresponding visits were due symptomatic kidney stones. Leveraging a dataset manually annotated ED reports, developed strategies enhance including prompt optimization, zero- few-shot prompting, fine-tuning, augmentation. Further, implemented fairness assessment bias mitigation methods investigate disparities by with respect race gender. A expert assessed explanations GPT-4 for its predictions they sound, factually correct, unrelated input prompt, or potentially harmful. The best results achieved (macro-F1 = 0.833, 95% confidence interval [CI] 0.826–0.841) GPT-3.5 0.796, CI 0.796–0.796). Ablation studies revealed that initial model benefits from fine-tuning. Adding demographic information prior disease history prompts allows make better decisions. Bias found exhibited no racial gender disparities, contrast GPT-3.5, which failed effectively diversity.

Язык: Английский

Процитировано

The Large Language Model ChatGPT-4 Demonstrates Excellent Triage Capabilities and Diagnostic Performance for Patients Presenting with Various Causes of Knee Pain DOI

Kyle N. Kunze, Nathan H. Varady, Michael Mazzucco

и другие.

Arthroscopy The Journal of Arthroscopic and Related Surgery, Год журнала: 2024, Номер unknown

Опубликована: Июнь 1, 2024

Язык: Английский

Процитировано

Superhuman performance on urology board questions using an explainable language model enhanced with European Association of Urology guidelines DOI

Martin J. Hetz,

Nicolas Carl,

Sarah Haggenmüller

и другие.

ESMO Real World Data and Digital Oncology, Год журнала: 2024, Номер 6, С. 100078 - 100078

Опубликована: Окт. 4, 2024

Язык: Английский

Процитировано

Fine-Tuning LLMs for Specialized Use Cases DOI

D. M. Anisuzzaman, Jeffrey G. Malins, Paul A. Friedman

и другие.

Mayo Clinic Proceedings Digital Health, Год журнала: 2024, Номер 3(1), С. 100184 - 100184

Опубликована: Ноя. 29, 2024

Large language models (LLMs) are a type of artificial intelligence, which operate by predicting and assembling sequences words that statistically likely to follow from given text input. With this basic ability, LLMs able answer complex questions extremely instructions. Products created using such as ChatGPT OpenAI Claude Anthropic have huge amount traction user engagements revolutionized the way we interact with technology, bringing new dimension human-computer interaction. Fine-tuning is process in pretrained model, an LLM, further trained on custom data set adapt it for specialized tasks or domains. In review, outline some major methodologic approaches techniques can be used fine-tune use cases enumerate general steps required carrying out LLM fine-tuning. We then illustrate few these describing several specific fine-tuning across medical subspecialties. Finally, close consideration benefits limitations associated cases, emphasis concerns field medicine.

Язык: Английский

Процитировано

Use of Artificial Intelligence for Liver Diseases: A Survey from the EASL Congress 2024 DOI

Laura Žigutytė, Thomas Sorz, Jan Clusmann

и другие.

JHEP Reports, Год журнала: 2024, Номер 6(12), С. 101209 - 101209

Опубликована: Сен. 6, 2024

Язык: Английский

Процитировано

Generative AI Chatbots for Reliable Cancer Information: Evaluating web-search, multilingual, and reference capabilities of emerging large language models DOI

Bradley D. Menz, Natansh D. Modi, Ahmad Y. Abuhelwa

и другие.

European Journal of Cancer, Год журнала: 2025, Номер 218, С. 115274 - 115274

Опубликована: Фев. 4, 2025

Recent advancements in large language models (LLMs) enable real-time web search, improved referencing, and multilingual support, yet ensuring they provide safe health information remains crucial. This perspective evaluates seven publicly accessible LLMs-ChatGPT, Co-Pilot, Gemini, MetaAI, Claude, Grok, Perplexity-on three simple cancer-related queries across eight languages (336 responses: English, French, Chinese, Thai, Hindi, Nepali, Vietnamese, Arabic). None of the 42 English responses contained clinically meaningful hallucinations, whereas 7 294 non-English did. 48 % (162/336) included valid references, but 39 references were.com links reflecting quality concerns. frequently exceeded an eighth-grade level, many outputs were also complex. These findings reflect substantial progress over past 2-years reveal persistent gaps accuracy, reliable reference inclusion, referral practices, readability. Ongoing benchmarking is essential to ensure LLMs safely support global dichotomy meet online standards.

Язык: Английский

Процитировано

Enhancing healthcare resource allocation through large language models DOI

Fang Wan, Kezhi Wang, Tao Wang

и другие.

Swarm and Evolutionary Computation, Год журнала: 2025, Номер 94, С. 101859 - 101859

Опубликована: Фев. 5, 2025

Язык: Английский

Процитировано

Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study DOI

Jessica D. Workum,

Bas W. S. Volkers,

Davy van de Sande

и другие.

Critical Care, Год журнала: 2025, Номер 29(1)

Опубликована: Фев. 10, 2025

Abstract Background Large language models (LLMs) show increasing potential for their use in healthcare administrative support and clinical decision making. However, reports on performance critical care medicine is lacking. Methods This study evaluated five LLMs (GPT-4o, GPT-4o-mini, GPT-3.5-turbo, Mistral 2407 Llama 3.1 70B) 1181 multiple choice questions (MCQs) from the gotheextramile.com database, a comprehensive database of at European Diploma Intensive Care examination level. Their was compared to random guessing 350 human physicians 77-MCQ practice test. Metrics included accuracy, consistency, domain-specific performance. Costs, as proxy energy consumption, were also analyzed. Results GPT-4o achieved highest accuracy 93.3%, followed by 70B (87.5%), (87.9%), GPT-4o-mini (83.0%), GPT-3.5-turbo (72.7%). Random yielded 41.5% ( p < 0.001). On test, all surpassed physicians, scoring 89.0%, 80.9%, 84.4%, 80.3%, 66.5%, respectively, 42.7% 0.001) 61.9% physicians. contrast other 0.001), GPT-3.5-turbo’s did not significantly outperform = 0.196). Despite high overall gave consistently incorrect answers. The most expensive model GPT-4o, costing over 25 times more than least model, GPT-4o-mini. Conclusions exhibit exceptional with four outperforming European-level exam. led but raised concerns about consumption. care, produced answers, highlighting need thorough ongoing evaluations guide responsible implementation settings.

Язык: Английский

Процитировано