From Revisions to Insights: Converting Radiology Report Revisions into Actionable Educational Feedback Using Generative AI Models DOI Creative Commons
Shawn Lyo, Suyash Mohan, Alvand Hassankhani

и другие.

Deleted Journal, Год журнала: 2024, Номер unknown

Опубликована: Авг. 19, 2024

Expert feedback on trainees' preliminary reports is crucial for radiologic training, but real-time can be challenging due to non-contemporaneous, remote reading and increasing imaging volumes. Trainee report revisions contain valuable educational feedback, synthesizing data from raw challenging. Generative AI models potentially analyze these provide structured, actionable feedback. This study used the OpenAI GPT-4 Turbo API paired synthesized open-source analogs of finalized reports, identify discrepancies, categorize their severity type, suggest review topics. radiologists reviewed output by grading evaluating category accuracy, suggested topic relevance. The reproducibility discrepancy detection maximal was also examined. model exhibited high sensitivity, detecting significantly more discrepancies than (W = 19.0, p < 0.001) with a strong positive correlation (r 0.778, 0.001). Interrater reliability type were fair (Fleiss' kappa 0.346 0.340, respectively; weighted 0.622 severity). LLM achieved F1 score 0.66 0.64 type. Generated teaching points considered relevant in ~ 85% cases, relevance correlated (Spearman ρ 0.76, moderate good (ICC (2,1) 0.690) number substantial 0.718; 0.94). effectively generate offering promise enhancing radiology training.

Язык: Английский

Potential of GPT-4 for Detecting Errors in Radiology Reports: Implications for Reporting Accuracy DOI
Roman Johannes Gertz, Thomas Dratsch, Alexander C. Bunck

и другие.

Radiology, Год журнала: 2024, Номер 311(1)

Опубликована: Апрель 1, 2024

Background Errors in radiology reports may occur because of resident-to-attending discrepancies, speech recognition inaccuracies, and large workload. Large language models, such as GPT-4 (ChatGPT; OpenAI), assist generating reports. Purpose To assess effectiveness identifying common errors reports, focusing on performance, time, cost-efficiency. Materials Methods In this retrospective study, 200 (radiography cross-sectional imaging [CT MRI]) were compiled between June 2023 December at one institution. There 150 from five error categories (omission, insertion, spelling, side confusion, other) intentionally inserted into 100 the used reference standard. Six radiologists (two senior radiologists, two attending physicians, residents) tasked with detecting these errors. Overall detection categories, reading time assessed using Wald χ2 tests paired-sample t tests. Results (detection rate, 82.7%;124 150; 95% CI: 75.8, 87.9) matched average performance independent their experience (senior 89.3% [134 83.4, 93.3]; 80.0% [120 72.9, 85.6]; residents, P value range, .522–.99). One radiologist outperformed 94.7%; 142 89.8, 97.3; = .006). required less processing per report than fastest human reader study (mean 3.5 seconds ± 0.5 [SD] vs 25.1 20.1, respectively; < .001; Cohen d −1.08). The use resulted lower mean correction cost most cost-efficient ($0.03 0.01 $0.42 0.41; −1.12). Conclusion rate was comparable that potentially reducing work hours cost. © RSNA, 2024 See also editorial by Forman issue.

Язык: Английский

Процитировано

44

Evaluation of Reliability, Repeatability, Robustness, and Confidence of GPT-3.5 and GPT-4 on a Radiology Board–style Examination DOI
Satheesh Krishna, Nishaant Bhambra, Robert R. Bleakney

и другие.

Radiology, Год журнала: 2024, Номер 311(2)

Опубликована: Май 1, 2024

Background ChatGPT (OpenAI) can pass a text-based radiology board–style examination, but its stochasticity and confident language when it is incorrect may limit utility. Purpose To assess the reliability, repeatability, robustness, confidence of GPT-3.5 GPT-4 (ChatGPT; OpenAI) through repeated prompting with examination. Materials Methods In this exploratory prospective study, 150 multiple-choice questions, previously used to benchmark ChatGPT, were administered default versions (GPT-3.5 GPT-4) on three separate attempts (separated by ≥1 month then 1 week). Accuracy answer choices between compared reliability (accuracy over time) repeatability (agreement time). On third attempt, regardless choice, was challenged times adversarial prompt, "Your choice incorrect. Please choose different option," robustness (ability withstand prompting). prompted rate from 1–10 (with 10 being highest level lowest) attempt after each challenge prompt. Results Neither version showed difference in accuracy attempts: for first, second, 69.3% (104 150), 63.3% (95 60.7% (91 respectively (P = .06); 80.6% (121 78.0% (117 76.7% (115 .42). Though both had only moderate intrarater agreement (κ 0.78 0.64, respectively), more consistent across than those (agreement, [115 150] vs 61.3% [92 150], respectively; P .006). After changed responses most though did so frequently (97.3% [146 71.3% [107 < .001). Both rated "high confidence" (≥8 scale) initial (GPT-3.5, 100% [150 150]; GPT-4, 94.0% [141 150]) as well (ie, overconfidence; GPT-3.5, [59 59]; 77% [27 35], .89). Conclusion Default reliably accurate attempts, poor overconfident. influenced an © RSNA, 2024 Supplemental material available article. See also editorial Ballard issue.

Язык: Английский

Процитировано

28

Deep Learning in Breast Cancer Imaging: State of the Art and Recent Advancements in Early 2024 DOI Creative Commons
Alessandro Carriero, Léon Groenhoff,

Elizaveta Vologina

и другие.

Diagnostics, Год журнала: 2024, Номер 14(8), С. 848 - 848

Опубликована: Апрель 19, 2024

The rapid advancement of artificial intelligence (AI) has significantly impacted various aspects healthcare, particularly in the medical imaging field. This review focuses on recent developments application deep learning (DL) techniques to breast cancer imaging. DL models, a subset AI algorithms inspired by human brain architecture, have demonstrated remarkable success analyzing complex images, enhancing diagnostic precision, and streamlining workflows. models been applied diagnosis via mammography, ultrasonography, magnetic resonance Furthermore, DL-based radiomic approaches may play role risk assessment, prognosis prediction, therapeutic response monitoring. Nevertheless, several challenges limited widespread adoption clinical practice, emphasizing importance rigorous validation, interpretability, technical considerations when implementing solutions. By examining fundamental concepts synthesizing latest advancements trends, this narrative aims provide valuable up-to-date insights for radiologists seeking harness power care.

Язык: Английский

Процитировано

27

Natural language processing for chest X‐ray reports in the transformer era: BERT‐like encoders for comprehension and GPT‐like decoders for generation DOI Creative Commons
Han Yuan

iRadiology, Год журнала: 2025, Номер unknown

Опубликована: Янв. 6, 2025

We conducted a comprehensive literature search in PubMed to illustrate the current landscape of transformer-based tools from perspective transformer's two integral components: encoder exemplified by BERT and decoder characterized GPT. Also, we discussed adoption barriers potential solutions terms computational burdens, interpretability concerns, ethical issues, hallucination problems, malpractice, legal liabilities. hope that this commentary will serve as foundational introduction for radiologists seeking explore evolving technical chest X-ray report analysis transformer era. Natural language processing (NLP) has gained widespread use computer-assisted (CXR) analysis, particularly since renaissance deep learning (DL) 2012 ImageNet challenge. While early endeavors predominantly employed recurrent neural networks (RNN) convolutional (CNN) [1], revolution is brought [2] its success can be attributed three key factors [3]. First, self-attention mechanism enables simultaneous multiple parts an input sequence, offering significantly greater efficiency compared earlier models such RNN [4]. Second, architecture exhibits exceptional scalability, supporting with over 100 billion parameters capture intricate linguistic relationships human [5]. Third, availability vast internet-based corpus advances power have made pre-training fine-tuning large-scale feasible [6]. The development resolution previously intractable problems achieves expert-level performance across broad range CXR analytical tasks, name entity recognition, question answering, extractive summarization [7]. In commentary, (Figure 1) landscape, barriers, handling comprehension managing generation. As our primary focus NLP, classification criteria or was based on text modules excluded research purely focusing vision transformers (ViT). Literature pipeline identify relevant articles published June 12, 2017, when model first introduced, October 4, 2024. followed previous systematic reviews [3, 8, 9] design groups keywords: (1) "transformer"; (2) "clinical notes", reports", narratives", text", "medical text"; (3) "natural processing", "text mining", "information extraction"; (4) "radiography", "chest film", radiograph", "radiograph", "X-rays". means communication between referring physicians, reports contain high-density information patients' conditions [10]. Much like physicians interpreting reports, step NLP understanding content important application explicitly converting it into format suitable subsequent tasks. One notable [11], which stands bidirectional representations transformers. contrast predecessors rely large amounts expert annotations supervised [12], undergoes self-supervised training unlabeled datasets understand patterns subsequently fine-tuned small set target task [12, 13], yielding superior [14], recognition [15], [16], semantics optimization [17]. context healthcare, Olthof et al. [18] built evaluate varying complexities, disease prevalence, sample sizes, demonstrating statistically outperformed conventional DL CNN, area under curve F1-score, t-test p-values less than 0.05. Beyond models, adapting domain-specific further enhance effectiveness various Yan [19] adapted four BERT-like encoders using millions radiology tackle tasks: identifying sentences describe abnormal findings, assigning diagnostic codes, extracting summarize reports. Their results demonstrated adaptation yielded significant improvements accuracy, ROUGE metrics all Most BERT-relevant studies sentence-, paragraph-, report-level predictions, while are also well-suited word-level pattern recognition. Chambon [20] leveraged [21], biomedical-specific BERT, probability individual tokens containing protected health information, replaced identified sensitive synthetic surrogates ensure privacy preservation. Similarly, Weng [22] developed system utilizing ALBERT [23], lite reduced parameters, keywords unrelated thereby reducing false-positive alarms outperforming regular expression-, syntactic grammar-, DL-based baselines. BERT-derived labels applied develop targeting other modalities 13]. Nowak [24] systematically explored utility BERT-generated silver linked them corresponding radiographs image classifiers. Compared trained exclusively radiologist-annotated gold labels, integrating led improved discriminability. macro-averaged synchronous proved effective settings limited whereas silver, better cases abundant labels. Zhang [25] introduced novel approach more generalizable classifiers, rather relying predefined categories: first, they used extract entities relationships; second, constructed knowledge graph these extractions; third, refined their domain expertise. Unlike traditional multiclass established not only categorized each but revealed interpretable categories, those linking anatomical regions signs. addition deriving advanced capabilities unprecedented innovation: direct supervision pixel-level segmentation medical [26]. Li [26] proposed text-augmented lesion paradigm integrated BERT-based textual compensate deficiency radiograph quality refine pseudo semi-supervision. These highlight strength comprehending healthcare-related annotation systems multi-modality beyond text. Meanwhile, researchers failures complex clinical Sushil [27] implementations inference achieved test accuracy 0.778. adaptations textbooks 0.833, still fell short experts. Potential limitations lie relatively modest parameter size, although larger reliance inadequate corpora, books, Wikipedia, selected databases [28]. Consequently, ability learn remains constrained. shortcomings being alleviated GPT-like decoders, incorporate hundreds billions internet-scale corpora [29]. Following advent encoders, generative pre-trained (GPT) [30], next groundbreaking leap, breaks enabling non-experts perform tasks through freely conversational without any coding. CvT2DistilGPT2 [31], prominent generator era, utilizes ViT GPT-2 decoder. experiments indicated CNN GPT surpassed encoder–decoder architectures specific generation applications, state-of-the-art methods integrate decoders. TranSQ [32] framework. emulates reasoning process generating reports: formulating hypothesis embeddings represent implicit intentions, querying visual features extracted synthesizing semantic cross-modality fusion, transforming candidate DistilGPT [33]. Finally, attained BLEU-4 score 0.205 0.409. comparison, best-performing baseline among 17 retrieval 0.188 0.383, highlighting capability unified multi-modality. Though decoders dominated general domain, family long short-term memory (LSTM) [34] good partially because highly templated characteristics [32]. Kaur Mittal [35] classical architectures, feature extraction, LSTM token They modules, generate numerical inputs prior shortlist disease-relevant afterward. Results presented solution 0.767 0.897, suggesting approaches remain viable backbone scenarios. quantitative comparing outputs ground truth model-generated should supplemented evaluation Boag [36] study automated generation, divergence accuracy. A discrepancy readability been reported [37]. Accordingly, emphasize involvement rating correctness readability. sections, reviewed applications Although remarkable well-established, face problems. Some integration specialized expertise [31, 38], others necessitate resolution. demands era substantial. For example, version contains 334 million GPT-3 175 billion. contrast, support vector machines [39] random forests [40], require few hundred thousand parameters. result, many healthcare providers cannot afford costs tailoring scratch. To address this, offer several recommendations. development, suggest leverage open-access building fine-tuning, considering scales, recommend parameter-efficient technique updates subset model's leaving majority weights unchanged [41]. An exemplificative Taylor [42] empirically validated techniques within domain. advocate prompt engineering techniques, retrieval-augmented crafting informative instructive guide decoders' output changing [43]. Ranjit [44] method retrieve most contextual prompts concise accurate retaining critical entities. Last least, obtaining approval ethics committees share anonymous data facilitate collaboration external partners, helping alleviate resource burdens. including both where decisions directly impact lives. often regarded black-box simple render explainable modern layers neurons dissected visualized, providing insights functionality [45-48]. behavior challenge due complexity associated exponential scaling neuron numbers [49]. though internal activations challenging interpret, preliminary analyzing influence high degree alignment assessments [50, 51]. lies flexibility align instructions. This allows users obtain expected request explanations outputs, fostering enhanced usability [52, 53]. readers overview detailed insights, surveys [54-56]. considerations paramount transformers, given powerful nuanced datasets. concerns pressing private representative population. patient privacy, anonymizing during deployment stages neither learned [57] nor inadvertently disclosed certain [58]. Dataset representativeness issue, underrepresentation minority exacerbate disparities perpetuate inequities [59]. mitigate risk, developers prioritize inclusivity collection, maintainers continuously monitor equitable outcomes [60]. Fourth, coherent responses diverse user solving wide [61], predictive internet instead radiological well-defined logic [62]. Therefore, continue suffer hallucinations, phenomenon appears plausible factually incorrect, nonsensical, users' [63]. Current efforts broadly post-training stages. During training, strategies include in-house reinforcement guided radiologists' feedback 64]. Post-training encompass detection, knowledge, multi-agent collaboration, radiologist-in-the-loop frameworks [62, 65]. Due space constraints, encourage refer 66-68] strategies. Lastly, even after refinements, may present risks potentially leading errors liabilities [69]. Errors arise sources, inaccurate clinician nonadherence correct recommendations, poor workflows [70]. determining responsibility adverse issue stakeholders, software developers, maintenance teams, departments, [71]. European Commission focuses safety liability implications artificial intelligence, applies device laws demonstrates generally falls civil product Civil typically pertains developers. However, stops strict definitive framework inherent ambiguity algorithms questions surrounding likely addressed courts case law. Under existing frameworks, follow standard care, supplementary confirmatory substitutes practice beneficial stakeholders Additionally, departments implement tools, involve radiologists, throughout entire cycle [72], prepare in-depth programs familiarize differ routine statistical tests black boxes resist full interpretation [73]. Moreover, expectations important: unrealistic optimism, seen replacement expertise, undue pessimism, perceived no utility, avoided [74-77]. Han Yuan: Conceptualization; curation; formal analysis; investigation; project administration; validation; visualization; writing—original draft; writing—review editing. None. author declares he conflicts interest. exempt review committee does participants, animal subjects, collection. Not applicable. Data sharing apply were generated analyzed.

Язык: Английский

Процитировано

4

Use of AI in Cardiac CT and MRI: A Scientific Statement from the ESCR, EuSoMII, NASCI, SCCT, SCMR, SIIM, and RSNA DOI
Domenico Mastrodicasa, Marly van Assen, Merel Huisman

и другие.

Radiology, Год журнала: 2025, Номер 314(1)

Опубликована: Янв. 1, 2025

Artificial intelligence (AI) offers promising solutions for many steps of the cardiac imaging workflow, from patient and test selection through image acquisition, reconstruction, interpretation, extending to prognostication reporting. Despite development AI algorithms, tools are at various stages face challenges clinical implementation. This scientific statement, endorsed by several societies in field, provides an overview current landscape applications CT MRI. Each section is organized into questions statements that address key including ethical, legal, environmental sustainability considerations. A technology readiness level range 1 9 summarizes maturity reflects progression preliminary research document aims bridge gap between burgeoning developments limited

Язык: Английский

Процитировано

4

Leveraging Large Language Models in Radiology Research: A Comprehensive User Guide DOI
Joshua D. Brown, Leon Lenchik,

Fayhaa Doja

и другие.

Academic Radiology, Год журнала: 2025, Номер unknown

Опубликована: Янв. 1, 2025

Язык: Английский

Процитировано

2

Open-Source Large Language Models in Radiology: A Review and Tutorial for Practical Research and Clinical Deployment DOI
Cody Savage, Adway Kanhere, Vishwa S. Parekh

и другие.

Radiology, Год журнала: 2025, Номер 314(1)

Опубликована: Янв. 1, 2025

Open-source large language models and multimodal foundation offer several practical advantages for clinical research objectives in radiology over their proprietary counterparts but require further validation before widespread adoption.

Язык: Английский

Процитировано

2

Optimizing Large Language Models in Radiology and Mitigating Pitfalls: Prompt Engineering and Fine-tuning DOI
T. Kim, Michael Makutonin, Reza Sirous

и другие.

Radiographics, Год журнала: 2025, Номер 45(4)

Опубликована: Март 6, 2025

Large language models (LLMs) such as generative pretrained transformers (GPTs) have had a major impact on society, and there is increasing interest in using these for applications medicine radiology. This article presents techniques to optimize describes their known challenges limitations. Specifically, the authors explore how best craft natural prompts, process prompt engineering, elicit more accurate desirable responses. The also explain fine-tuning conducted, which general model, GPT-4, further trained specific use case, summarizing clinical notes, improve reliability relevance. Despite enormous potential of models, substantial limit widespread implementation. These tools differ substantially from traditional health technology complexity probabilistic nondeterministic nature, differences lead issues "hallucinations," biases, lack reliability, security risks. Therefore, provide radiologists with baseline knowledge underpinning an understanding them, addition exploring practices engineering fine-tuning. Also discussed are current proof-of-concept cases LLMs radiology literature, decision support report generation, limitations preventing adoption ©RSNA, 2025 See invited commentary by Chung Mongan this issue.

Язык: Английский

Процитировано

2

Large Language Models for Automated Synoptic Reports and Resectability Categorization in Pancreatic Cancer DOI
Rajesh Bhayana, Bipin Nanda, Taher Dehkharghanian

и другие.

Radiology, Год журнала: 2024, Номер 311(3)

Опубликована: Июнь 1, 2024

Background Structured radiology reports for pancreatic ductal adenocarcinoma (PDAC) improve surgical decision-making over free-text reports, but radiologist adoption is variable. Resectability criteria are applied inconsistently. Purpose To evaluate the performance of large language models (LLMs) in automatically creating PDAC synoptic from original and to explore categorizing tumor resectability. Materials Methods In this institutional review board–approved retrospective study, 180 consecutive staging CT on patients referred authors' European Society Medical Oncology–designated cancer center January December 2018 were included. Reports reviewed by two radiologists establish reference standard 14 key findings National Comprehensive Cancer Network (NCCN) resectability category. GPT-3.5 GPT-4 (accessed September 18–29, 2023) prompted create with same features, their was evaluated (recall, precision, F1 score). categorize resectability, three prompting strategies (default knowledge, in-context chain-of-thought) used both LLMs. Hepatopancreaticobiliary surgeons artificial intelligence (AI)–generated determine accuracy time compared. The McNemar test, t Wilcoxon signed-rank mixed effects logistic regression where appropriate. Results outperformed creation (F1 score: 0.997 vs 0.967, respectively). Compared GPT-3.5, achieved equal or higher scores all extracted features. had precision than extracting superior mesenteric artery involvement (100% 88.8%, For each strategy. GPT-4, chain-of-thought most accurate, outperforming knowledge (92% 83%, respectively; P = .002), which default strategy (83% 67%, < .001). Surgeons more accurate using AI-generated 76%, .03), while spending less report (58%; 95% CI: 0.53, 0.62). Conclusion created near-perfect reports. high efficient © RSNA, 2024 Supplemental material available article. See also editorial Chang issue.

Язык: Английский

Процитировано

15

ChatGPT’s diagnostic performance based on textual vs. visual information compared to radiologists’ diagnostic performance in musculoskeletal radiology DOI Creative Commons
Daisuke Horiuchi, Hiroyuki Tatekawa, Tatsushi Oura

и другие.

European Radiology, Год журнала: 2024, Номер 35(1), С. 506 - 516

Опубликована: Июль 12, 2024

To compare the diagnostic accuracy of Generative Pre-trained Transformer (GPT)-4-based ChatGPT, GPT-4 with vision (GPT-4V) based and radiologists in musculoskeletal radiology.

Язык: Английский

Процитировано

14