Evaluating Large Language Models for Drafting Emergency Department Discharge Summaries DOI Creative Commons
Christopher Y. K. Williams, Jaskaran Bains,

Tianyu Tang

et al.

medRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: April 4, 2024

Abstract Importance Large language models (LLMs) possess a range of capabilities which may be applied to the clinical domain, including text summarization. As ambient artificial intelligence scribes and other LLM-based tools begin deployed within healthcare settings, rigorous evaluations accuracy these technologies are urgently needed. Objective To investigate performance GPT-4 GPT-3.5-turbo in generating Emergency Department (ED) discharge summaries evaluate prevalence type errors across each section summary. Design Cross-sectional study. Setting University California, San Francisco ED. Participants We identified all adult ED visits from 2012 2023 with an clinician note. randomly selected sample 100 for GPT-summarization. Exposure potential two state-of-the-art LLMs, GPT-3.5-turbo, summarize full note into Main Outcomes Measures GPT-4-generated were evaluated by independent Medicine physician reviewers three evaluation criteria: 1) Inaccuracy GPT-summarized information; 2) Hallucination 3) Omission relevant information. On identifying error, additionally asked provide brief explanation their reasoning, was manually classified subgroups errors. Results From 202,059 eligible visits, we sampled GPT-generated summarization then expert-driven evaluation. In total, 33% generated 10% those entirely error-free domains. Summaries mostly accurate, inaccuracies found only cases, however, 42% exhibited hallucinations 47% omitted clinically Inaccuracies most commonly Plan sections summaries, while omissions concentrated describing patients’ Physical Examination findings or History Presenting Complaint. Conclusions Relevance this cross-sectional study encounters, that LLMs could generate accurate but liable hallucination omission A comprehensive understanding location is important facilitate review such content prevent patient harm.

Language: Английский

Chatbots and Large Language Models in Radiology: A Practical Primer for Clinical and Research Applications DOI
Rajesh Bhayana

Radiology, Journal Year: 2024, Volume and Issue: 310(1)

Published: Jan. 1, 2024

Although chatbots have existed for decades, the emergence of transformer-based large language models (LLMs) has captivated world through most recent wave artificial intelligence chatbots, including ChatGPT. Transformers are a type neural network architecture that enables better contextual understanding and efficient training on massive amounts unlabeled data, such as unstructured text from internet. As LLMs increased in size, their improved performance emergent abilities revolutionized natural processing. Since is integral to human thought, applications based transformative potential many industries. In fact, LLM-based demonstrated human-level professional benchmarks, radiology. offer numerous clinical research radiology, several which been explored literature with encouraging results. Multimodal can simultaneously interpret images generate reports, closely mimicking current diagnostic pathways Thus, requisition report, opportunity positively impact nearly every step radiology journey. Yet, these impressive not without limitations. This article reviews limitations mitigation strategies, well uses LLMs, multimodal models. Also reviewed existing enhance efficiency supervised settings.

Language: Английский

Citations

105

Potential of GPT-4 for Detecting Errors in Radiology Reports: Implications for Reporting Accuracy DOI
Roman Johannes Gertz, Thomas Dratsch, Alexander C. Bunck

et al.

Radiology, Journal Year: 2024, Volume and Issue: 311(1)

Published: April 1, 2024

Background Errors in radiology reports may occur because of resident-to-attending discrepancies, speech recognition inaccuracies, and large workload. Large language models, such as GPT-4 (ChatGPT; OpenAI), assist generating reports. Purpose To assess effectiveness identifying common errors reports, focusing on performance, time, cost-efficiency. Materials Methods In this retrospective study, 200 (radiography cross-sectional imaging [CT MRI]) were compiled between June 2023 December at one institution. There 150 from five error categories (omission, insertion, spelling, side confusion, other) intentionally inserted into 100 the used reference standard. Six radiologists (two senior radiologists, two attending physicians, residents) tasked with detecting these errors. Overall detection categories, reading time assessed using Wald χ2 tests paired-sample t tests. Results (detection rate, 82.7%;124 150; 95% CI: 75.8, 87.9) matched average performance independent their experience (senior 89.3% [134 83.4, 93.3]; 80.0% [120 72.9, 85.6]; residents, P value range, .522–.99). One radiologist outperformed 94.7%; 142 89.8, 97.3; = .006). required less processing per report than fastest human reader study (mean 3.5 seconds ± 0.5 [SD] vs 25.1 20.1, respectively; < .001; Cohen d −1.08). The use resulted lower mean correction cost most cost-efficient ($0.03 0.01 $0.42 0.41; −1.12). Conclusion rate was comparable that potentially reducing work hours cost. © RSNA, 2024 See also editorial by Forman issue.

Language: Английский

Citations

44

Generative artificial intelligence in healthcare: A scoping review on benefits, challenges and applications DOI
Khadijeh Moulaei,

Atiye Yadegari,

Mahdi Baharestani

et al.

International Journal of Medical Informatics, Journal Year: 2024, Volume and Issue: 188, P. 105474 - 105474

Published: May 8, 2024

Language: Английский

Citations

44

Use of a Large Language Model to Assess Clinical Acuity of Adults in the Emergency Department DOI Creative Commons
Christopher Y. K. Williams, Travis Zack, Brenda Y. Miao

et al.

JAMA Network Open, Journal Year: 2024, Volume and Issue: 7(5), P. e248895 - e248895

Published: May 7, 2024

The introduction of large language models (LLMs), such as Generative Pre-trained Transformer 4 (GPT-4; OpenAI), has generated significant interest in health care, yet studies evaluating their performance a clinical setting are lacking. Determination acuity, measure patient's illness severity and level required medical attention, is one the foundational elements reasoning emergency medicine.

Language: Английский

Citations

38

The Temperature Feature of ChatGPT: Modifying Creativity for Clinical Research DOI Creative Commons
Joshua Davis, Liesbet Van Bulck, Brigitte N. Durieux

et al.

JMIR Human Factors, Journal Year: 2024, Volume and Issue: 11, P. e53559 - e53559

Published: Jan. 24, 2024

More clinicians and researchers are exploring uses for large language model chatbots, such as ChatGPT, research, dissemination, educational purposes. Therefore, it becomes increasingly relevant to consider the full potential of this tool, including special features that currently available through application programming interface. One these is a variable called temperature, which changes degree randomness involved in model’s generated output. This particular interest researchers. By lowering variable, one can generate more consistent outputs; by increasing it, receive creative responses. For who tools variety tasks, ability tailor outputs be less may beneficial work demands consistency. Additionally, access text generation enable scientific authors describe their research general potentially connect with broader public social media. In viewpoint, we present temperature feature, discuss uses, provide some examples.

Language: Английский

Citations

35

Large Language Models: A Guide for Radiologists DOI
Sunkyu Kim, Choong‐kun Lee, Seung‐seob Kim

et al.

Korean Journal of Radiology, Journal Year: 2024, Volume and Issue: 25(2), P. 126 - 126

Published: Jan. 1, 2024

Large language models (LLMs) have revolutionized the global landscape of technology beyond natural processing. Owing to their extensive pre-training on vast datasets, contemporary LLMs can handle tasks ranging from general functionalities domain-specific areas, such as radiology, without additional fine-tuning. General-purpose chatbots based optimize efficiency radiologists in terms professional work and research endeavors. Importantly, these are a trajectory rapid evolution, wherein challenges "hallucination," high training cost, issues addressed, along with inclusion multimodal inputs. In this review, we aim offer conceptual knowledge actionable guidance interested utilizing through succinct overview topic summary radiology-specific aspects, beginning potential future directions.

Language: Английский

Citations

29

Collaborative Enhancement of Consistency and Accuracy in US Diagnosis of Thyroid Nodules Using Large Language Models DOI
Shaohong Wu,

Wenjuan Tong,

Ming‐De Li

et al.

Radiology, Journal Year: 2024, Volume and Issue: 310(3)

Published: March 1, 2024

Background Large language models (LLMs) hold substantial promise for medical imaging interpretation. However, there is a lack of studies on their feasibility in handling reasoning questions associated with diagnosis. Purpose To investigate the viability leveraging three publicly available LLMs to enhance consistency and diagnostic accuracy based standardized reporting, pathology as reference standard. Materials Methods US images thyroid nodules pathologic results were retrospectively collected from tertiary referral hospital between July 2022 December used evaluate malignancy diagnoses generated by LLMs-OpenAI's ChatGPT 3.5, 4.0, Google's Bard. Inter- intra-LLM agreement diagnosis evaluated. Then, performance, including accuracy, sensitivity, specificity, area under receiver operating characteristic curve (AUC), was evaluated compared interactive approaches: human reader combined LLMs, image-to-text model an end-to-end convolutional neural network model. Results A total 1161 (498 benign, 663 malignant) 725 patients (mean age, 42.2 years ± 14.1 [SD]; 516 women) 4.0 Bard displayed almost perfect (κ range, 0.65-0.86 [95% CI: 0.64, 0.86]), while 3.5 showed fair 0.36-0.68 0.36, 0.68]). had 78%-86% (95% 76%, 88%) sensitivity 86%-95% 83%, 96%), 74%-86% 71%, 74%-91% 93%), respectively, Moreover, image-to-text-LLM strategy exhibited AUC (0.83 0.80, 0.85]) (84% 82%, 86%]) comparable those human-LLM interaction two senior readers one junior exceeding reader. Conclusion particularly integrated approaches, show potential enhancing imaging. optimal when 3.5. © RSNA, 2024

Language: Английский

Citations

25

Data Extraction from Free-Text Reports on Mechanical Thrombectomy in Acute Ischemic Stroke Using ChatGPT: A Retrospective Analysis DOI
Nils Christian Lehnen, Franziska Dorn, Isabella C. Wiest

et al.

Radiology, Journal Year: 2024, Volume and Issue: 311(1)

Published: April 1, 2024

Background Procedural details of mechanical thrombectomy in patients with ischemic stroke are important predictors clinical outcome and collected for prospective studies or national registries. To date, these data manually by human readers, a labor-intensive task that is prone to errors. Purpose evaluate the use large language models (LLMs) GPT-4 GPT-3.5 extract from neuroradiology reports on stroke. Materials Methods This retrospective study included consecutive who underwent between November 2022 September 2023 at institution 1 2016 December 2019 2. A set 20 was used optimize prompt, ability LLMs procedural compared using McNemar test. Data extracted an interventional neuroradiologist served as reference standard. Results total 100 internal (mean age, 74.7 years ± 13.2 [SD]; 53 female) 30 external 72.7 13.5; 18 male) were included. All successfully processed GPT-3.5. Of 2800 entries, 2631 (94.0% [95% CI: 93.0, 94.8]; range per category, 61%–100%) points correctly without need further postprocessing. With 1788 correct produced fewer entries than did (63.9% 62.0, 65.6]; 14%–99%; P < .001). For reports, 760 840 (90.5% 88.3, 92.4]) while 539 (64.2% 60.8, 67.4]; Conclusion Compared GPT-3.5, more frequently free-text performed © RSNA, 2024 Supplemental material available this article.

Language: Английский

Citations

23

ChatGPT in radiology: A systematic review of performance, pitfalls, and future perspectives DOI Creative Commons

Pedram Keshavarz,

Sara Bagherieh,

Seyed Ali Nabipoorashrafi

et al.

Diagnostic and Interventional Imaging, Journal Year: 2024, Volume and Issue: 105(7-8), P. 251 - 265

Published: April 27, 2024

The purpose of this study was to systematically review the reported performances ChatGPT, identify potential limitations, and explore future directions for its integration, optimization, ethical considerations in radiology applications.

Language: Английский

Citations

23

Performance of an Open-Source Large Language Model in Extracting Information from Free-Text Radiology Reports DOI
Bastien Le Guellec,

Alexandre Lefèvre,

Charlotte Geay

et al.

Radiology Artificial Intelligence, Journal Year: 2024, Volume and Issue: 6(4)

Published: May 8, 2024

Purpose To assess the performance of a local open-source large language model (LLM) in various information extraction tasks from real-life emergency brain MRI reports. Materials and Methods All consecutive reports written 2022 French quaternary center were retrospectively reviewed. Two radiologists identified scans that performed department for headaches. Four scored reports' conclusions as either normal or abnormal. Abnormalities labeled headache-causing incidental. Vicuna (LMSYS Org), an LLM, same tasks. Vicuna's metrics evaluated using radiologists' consensus reference standard. Results Among 2398 during study period, 595 included headaches indication (median age patients, 35 years [IQR, 26-51 years]; 68% [403 595] women). A positive finding was reported 227 (38%) cases, 136 which could explain headache. The LLM had sensitivity 98.0% (95% CI: 96.5, 99.0) specificity 99.3% 98.8, 99.7) detecting presence headache clinical context, 99.4% 98.3, 99.9) 98.6% 92.2, 100.0) use contrast medium injection, 96.0% 92.5, 98.2) 98.9% 97.2, categorization abnormal, 88.2% 81.6, 93.1) 73% 62, 81) causal inference between findings Conclusion An able to extract free-text radiology with excellent accuracy without requiring further training.

Language: Английский

Citations

19