Evaluating Large Language Models for Drafting Emergency Department Discharge Summaries DOI Creative Commons
Christopher Y. K. Williams, Jaskaran Bains,

Tianyu Tang

et al.

medRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: April 4, 2024

Abstract Importance Large language models (LLMs) possess a range of capabilities which may be applied to the clinical domain, including text summarization. As ambient artificial intelligence scribes and other LLM-based tools begin deployed within healthcare settings, rigorous evaluations accuracy these technologies are urgently needed. Objective To investigate performance GPT-4 GPT-3.5-turbo in generating Emergency Department (ED) discharge summaries evaluate prevalence type errors across each section summary. Design Cross-sectional study. Setting University California, San Francisco ED. Participants We identified all adult ED visits from 2012 2023 with an clinician note. randomly selected sample 100 for GPT-summarization. Exposure potential two state-of-the-art LLMs, GPT-3.5-turbo, summarize full note into Main Outcomes Measures GPT-4-generated were evaluated by independent Medicine physician reviewers three evaluation criteria: 1) Inaccuracy GPT-summarized information; 2) Hallucination 3) Omission relevant information. On identifying error, additionally asked provide brief explanation their reasoning, was manually classified subgroups errors. Results From 202,059 eligible visits, we sampled GPT-generated summarization then expert-driven evaluation. In total, 33% generated 10% those entirely error-free domains. Summaries mostly accurate, inaccuracies found only cases, however, 42% exhibited hallucinations 47% omitted clinically Inaccuracies most commonly Plan sections summaries, while omissions concentrated describing patients’ Physical Examination findings or History Presenting Complaint. Conclusions Relevance this cross-sectional study encounters, that LLMs could generate accurate but liable hallucination omission A comprehensive understanding location is important facilitate review such content prevent patient harm.

Language: Английский

A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports DOI Creative Commons
Madhumita Sushil, Travis Zack, Divneet Mandair

et al.

Journal of the American Medical Informatics Association, Journal Year: 2024, Volume and Issue: 31(10), P. 2315 - 2327

Published: June 20, 2024

Although supervised machine learning is popular for information extraction from clinical notes, creating large annotated datasets requires extensive domain expertise and time-consuming. Meanwhile, language models (LLMs) have demonstrated promising transfer capability. In this study, we explored whether recent LLMs could reduce the need large-scale data annotations.

Language: Английский

Citations

12

ChatGPT and assistive AI in structured radiology reporting: A systematic review DOI Creative Commons
Ethan Sacoransky, Benjamin Y. M. Kwan,

Donald Soboleski

et al.

Current Problems in Diagnostic Radiology, Journal Year: 2024, Volume and Issue: 53(6), P. 728 - 737

Published: July 9, 2024

The rise of transformer-based large language models (LLMs), such as ChatGPT, has captured global attention with recent advancements in artificial intelligence (AI). ChatGPT demonstrates growing potential structured radiology reporting—a field where AI traditionally focused on image analysis. A comprehensive search MEDLINE and Embase was conducted from inception through May 2024, primary studies discussing ChatGPT's role reporting were selected based their content. Of the 268 articles screened, eight ultimately included this review. These explored various applications generating reports unstructured reports, extracting data free text, impressions findings creating imaging data. All demonstrated optimism regarding to aid radiologists, though common critiques privacy concerns, reliability, medical errors, lack medical-specific training. assistive have significant transform reporting, enhancing accuracy standardization while optimizing healthcare resources. Future developments may involve integrating dynamic few-shot prompting, Retrieval Augmented Generation (RAG) into diagnostic workflows. Continued research, development, ethical oversight are crucial fully realize AI's radiology.

Language: Английский

Citations

12

OpenAI’s GPT-4o in surgical oncology: Revolutionary advances in generative artificial intelligence DOI
Ning Zhu,

Nan Zhang,

Qipeng Shao

et al.

European Journal of Cancer, Journal Year: 2024, Volume and Issue: 206, P. 114132 - 114132

Published: May 26, 2024

Language: Английский

Citations

10

Human‐in‐the‐loop machine learning for healthcare: Current progress and future opportunities in electronic health records DOI Creative Commons
Han Yuan, Lican Kang, Yong Li

et al.

Medicine Advances, Journal Year: 2024, Volume and Issue: 2(3), P. 318 - 322

Published: Aug. 23, 2024

This article performs a literature search to determine the current progress, identifies research gaps, and highlights future opportunities of human-in-the-loop across machine learning lifecycle, including data preparation, feature engineering, model development, deployment. Machine (ML), particularly deep learning, has emerged as fundamental analytical tool for various medical tasks in electronic health records (EHRs) [1]. However, purely data-driven methods do not serve panacea all encountered problems such annotation. To address these issues, (HITL) increasingly gained prominence. It leverages human expertise improve ML-based analyses [2]. In this commentary, we perform identify progress (Figure 1), highlight HITL ML Pipeline records. The first phase which enhances is preparation. includes extraction, preprocessing, annotation large-scale raw EHRs into formats operable downstream modeling [3]. Across three preparation steps, focal point latest because traditional paradigm indiscriminately annotates available samples by default, places an unnecessary burden on experts time-urgent settings. Bull et al. [4] designed interactive platform that enables clinicians verify or correct labels predicted ML. Evaluated two databases, developed quickly generated accurate models with reduced needs. Similar strategies have been implemented detecting unauthorized access extraction [5] deidentifying free text [6] preprocessing. Given powerful ability foundation zero-shot inference [7], studies may use them, GPT-4, replace homemade computer-aided annotations [8]. Moreover, predominantly focus annotation; hence, there remains vast unexplored blue ocean integration, noise filtering missing value imputation [9]. Building well-prepared datasets, subsequent applications HITL-ML span engineering development. Feature without relies either fully automated manual methods, demand large amounts computation resources expert involvement. incorporation enabled generation novel features comparable quality at speeds up 20 times faster than original [10]. classic ML, long deemed essential, preceding development numerous contexts its demonstrated efficacy enhancing performance. recent years, notable shift occurred toward end-to-end gradually rendering less pivotal An exemplification trend can be observed artificial neural networks, where shallow layers undertake task layers, thereby enabling automatic seamless optimization during Within paradigm, improves both architecture design parameter optimization. Sheng [11] invited doctors modify structure causal relationships knowledge graph distilled from EHRs, demonstrating elevated only accuracy but also interpretability Rather adjusting post-training, Tari [12] applied training adding preferences target gold-standard labels. [11, 12] augmented HITL; however, aspect fairness sufficiently emphasized. Future researchers introduce post-hoc recalibrations following [11], alternatively, embed objective process like mitigate potential inequities [13]. Furthermore, present EHR classification regression. endeavors could venture support privacy-preserving synthetic pseudo [14]. Once complete, final lifecycle deployment, encompasses continuous monitoring updating trained models. incorporated ensure accuracy, interpretability, compatibility temporal spatial shifts. Doctors engaged double-check intervention suggestions models, positive infection cases [15] medication doses [16]. Instead seeking approval decisions clinicians, Zheng [17] proposed should able distinguish difficult simple so solved respectively. addition ensuring through [15-17], Elshawi [18] Yuan [19] used clinician-labeled concepts interpret behaviors clarified their advantages over explanations solely Research deployment broaden consider privacy. resource efficiency neglected executed mobile devices limited capability [20]. Even cloud infrastructure, time privacy-sensitive applications, efficient run edge low latency benefits. Finally, most previous deployments were simulated using retrospective prompts resolve performance deterioration prospective clinical landscapes [21]. shown refines catalyzes advancements yielding tools are interpretable. Despite elucidation leveraging full yet harnessed. Figure 2 shows overview existing gaps opportunities. synergistic interaction among healthcare professionals, engineers, high-performance computers poised fulfill enhancement human-computer promises efficiency, robustness foster impartiality, privacy preservation systems. Schematic plot lifecycle. Han Yuan: Conceptualization (lead); curation formal analysis investigation project administration visualization writing – draft writing– review & editing (lead). Lican Kang: Data (supporting); (supporting). Yong Li: Investigation methodology Zhenqian Fan: authors nothing report. All declare they no conflicts interest. Not applicable. sharing applicable new was created analyzed study.

Language: Английский

Citations

10

Evaluating Large Language Models for Drafting Emergency Department Discharge Summaries DOI Creative Commons
Christopher Y. K. Williams, Jaskaran Bains,

Tianyu Tang

et al.

medRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: April 4, 2024

Abstract Importance Large language models (LLMs) possess a range of capabilities which may be applied to the clinical domain, including text summarization. As ambient artificial intelligence scribes and other LLM-based tools begin deployed within healthcare settings, rigorous evaluations accuracy these technologies are urgently needed. Objective To investigate performance GPT-4 GPT-3.5-turbo in generating Emergency Department (ED) discharge summaries evaluate prevalence type errors across each section summary. Design Cross-sectional study. Setting University California, San Francisco ED. Participants We identified all adult ED visits from 2012 2023 with an clinician note. randomly selected sample 100 for GPT-summarization. Exposure potential two state-of-the-art LLMs, GPT-3.5-turbo, summarize full note into Main Outcomes Measures GPT-4-generated were evaluated by independent Medicine physician reviewers three evaluation criteria: 1) Inaccuracy GPT-summarized information; 2) Hallucination 3) Omission relevant information. On identifying error, additionally asked provide brief explanation their reasoning, was manually classified subgroups errors. Results From 202,059 eligible visits, we sampled GPT-generated summarization then expert-driven evaluation. In total, 33% generated 10% those entirely error-free domains. Summaries mostly accurate, inaccuracies found only cases, however, 42% exhibited hallucinations 47% omitted clinically Inaccuracies most commonly Plan sections summaries, while omissions concentrated describing patients’ Physical Examination findings or History Presenting Complaint. Conclusions Relevance this cross-sectional study encounters, that LLMs could generate accurate but liable hallucination omission A comprehensive understanding location is important facilitate review such content prevent patient harm.

Language: Английский

Citations

9