Cited by Prostate-MRI reporting should be done with the aid of AI systems: Pros

Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4V in Challenging Brain MRI Cases DOI

Severin Schramm,

Silas Preis,

Marie‐Christin Metz

et al.

Radiology, Journal Year: 2025, Volume and Issue: 314(1)

Published: Jan. 1, 2025

Textual descriptions of radiologic image findings play a critical role in GPT-4 with vision–based differential diagnosis, underlining the importance radiologist experts even multimodal large language models.

Language: Английский

Citations

Human-AI collaboration in large language model-assisted brain MRI differential diagnosis: a usability study DOI

Su Hwan Kim,

Jonas Wihl,

Severin Schramm

et al.

European Radiology, Journal Year: 2025, Volume and Issue: unknown

Published: March 7, 2025

This study investigated the impact of human-large language model (LLM) collaboration on accuracy and efficiency brain MRI differential diagnosis. In this retrospective study, forty cases with a challenging but definitive diagnosis were randomized into two groups twenty each. Six radiology residents an average experience 6.3 months in reading exams evaluated one set supported by conventional internet search (Conventional) other utilizing LLM-based engine hybrid chatbot. A cross-over design ensured that each case was examined both workflows equal frequency. For case, readers instructed to determine three most likely diagnoses. LLM responses analyzed panel radiologists. Benefits challenges human-LLM interaction derived from observations participant feedback. LLM-assisted yielded superior (70/114; 61.4% (LLM-assisted) vs 53/114; 46.5% (conventional) correct diagnoses, p = 0.033, chi-square test). No difference interpretation time or level confidence observed. An analysis revealed suggestions translated reader 82.1% (60/73). Inaccurate descriptions (9.2% cases), hallucinations (11.5% insufficient contextualization identified as related interaction. Human-LLM has potential improve Yet, several must be addressed ensure effective adoption user acceptance. Question While large models have support radiological diagnosis, role context remains underexplored. Findings over search. descriptions, hallucinations, challenges. Clinical relevance Our results highlight workflow increase diagnostic underline necessity collaborative efforts between humans LLMs isolation.

Language: Английский

Citations

Benchmarking the diagnostic performance of open source LLMs in 1933 Eurorad case reports DOI

Su Hwan Kim, Severin Schramm, Lisa C. Adams

et al.

npj Digital Medicine, Journal Year: 2025, Volume and Issue: 8(1)

Published: Feb. 12, 2025

Abstract Recent advancements in large language models (LLMs) have created new ways to support radiological diagnostics. While both open-source and proprietary LLMs can address privacy concerns through local or cloud deployment, provide advantages continuity of access, potentially lower costs. This study evaluated the diagnostic performance fifteen one closed-source LLM (GPT-4o) 1,933 cases from Eurorad library. provided differential diagnoses based on clinical history imaging findings. Responses were considered correct if true diagnosis appeared top three suggestions. Models further tested 60 non-public brain MRI a tertiary hospital assess generalizability. In datasets, GPT-4o demonstrated superior performance, closely followed by Llama-3-70B, revealing how are rapidly closing gap models. Our findings highlight potential as decision tools for challenging, real-world cases.

Language: Английский

Citations

An assessment of ChatGPT in error detection for thyroid ultrasound reports: A comparative study with ultrasound physicians DOI

Zhirong Xu,

Jia-yi Ye,

Weiwen Luo

et al.

Digital Health, Journal Year: 2025, Volume and Issue: 11

Published: Feb. 1, 2025

Background This study evaluates the performance of GPT-4o in detecting errors ACR TIRADS ultrasound reports and its potential to reduce report generation time. Methods A retrospective analysis 200 thyroid from Second Affiliated Hospital Fujian Medical University was conducted, with categorized as correct or containing up three errors. GPT-4o's compared physicians varying experience levels error detection processing Results detected 90.0% (180/200) errors, slightly less than best-performing senior physician's 93.0% (186/200) no significant difference ( p = 0.281). rate comparable that overall 0.098 0.866). It outperformed Resident 2 diagnostic (87% vs. 69%). Reader agreement low (Cohen's kappa 0 0.31). reviewed significantly faster all (0.79 1.8 3.1 h, < 0.001), making it a reliable efficient tool for medical imaging. Conclusions is experienced improves efficiency, offering valuable enhancing accuracy aiding junior residents.

Language: Английский

Citations

On-table monitoring of prostate MRI could enable tailored utilisation of gadolinium contrast DOI

Tom Syer,

Bruno Carmo,

Nimalam Sanmugalingam

et al.

European Radiology, Journal Year: 2025, Volume and Issue: unknown

Published: March 15, 2025

Abstract Objectives To compare the impact of on-table monitoring vs standard-of-care multiparametric MRI (mpMRI) for utilisation gadolinium contrast use in prostate MRI. Materials and methods This retrospective observation study prospectively acquired data was conducted at a single institution over an 18-month period. A cohort patients undergoing suspected cancer (PCa) underwent where their T2 DWI images were reviewed by supervising radiologist during scan to decide whether acquire dynamic contrast-enhanced (DCE) sequences. scans reported using PI-RADS v2.1, followed up with biopsy least 12 months. The rate administration, rates, diagnostic accuracy compared that control group mpMRI same period propensity score matching. Estimates cost savings also calculated. Results 1410 identified after matching 598 analysed, 178 monitoring. Seventy-five eight tenths (135/178) did not receive gadolinium. Contrast used mainly indeterminate lesions (27/43) significant artefacts on bpMRI (14/43). When comparing monitored non-monitored group, there comparable number biopsies performed (52.2% 49.5%, p = 0.54), 3/5 scoring rates (10.1% 7.4%, 0.27), sensitivity (98.3% 99.2%, 0.56), specificity (63.9% 70.7%, 0.18) detection clinically-significant PCa. acquired, DCE deemed helpful 67.4% (29/43) cases improved both PI-QUALv2 reader confidence scores. There estimated saving £56,677 study. Conclusion On-table significantly reduced need without compromising rates. Key Points Question Default is always clinical benefit has associated side effects healthcare costs . Findings avoided 75.8% patients, reducing whilst maintaining clinically detection, improving Clinical relevance O n-table offers personalised patient protocolling reduction its costs, potentially maximising advantages biparametric Graphical

Language: Английский

Citations

Exploring whether ChatGPT-4 with image analysis capabilities can diagnose osteosarcoma from X-ray images DOI

Yi Ren, Yusheng Guo,

Qingliu He

et al.

Experimental Hematology and Oncology, Journal Year: 2024, Volume and Issue: 13(1)

Published: July 27, 2024

Abstract The generation of radiological results from image data represents a pivotal aspect medical analysis. latest iteration ChatGPT-4, large multimodal model that integrates both text and inputs, including dermatoscopy images, histology X-ray has attracted considerable attention in the field radiology. To further investigate performance ChatGPT-4 recognition, we examined ability to recognize credible osteosarcoma images. demonstrated can more accurately diagnose bone with or without significant space-occupying lesions but limited differentiate between malignant compared adjacent normal tissue. Thus far, current capabilities are insufficient make reliable imaging diagnosis osteosarcoma. Therefore, users should be aware limitations this technology.

Language: Английский

Citations

Empowering Radiologists with ChatGPT-4o: Comparative Evaluation of Large Language Models and Radiologists in Cardiac Cases DOI

Turay Cesur, Yasin Celal Güneş, Eren Çamur

et al.

medRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: June 25, 2024

ABSTRACT Purpose This study evaluated the diagnostic accuracy and differential diagnosis capabilities of 12 Large Language Models (LLMs), one cardiac radiologist, three general radiologists in radiology. The impact ChatGPT-4o assistance on radiologist performance was also investigated. Materials Methods We collected publicly available 80 “Cardiac Case Month’’ from Society Thoracic Radiology website. LLMs Radiologist-III were provided with text-based information, whereas other visually assessed cases without assistance. Diagnostic scores (DDx Score) analyzed using chi-square, Kruskal-Wallis, Wilcoxon, McNemar, Mann-Whitney U tests. Results unassisted 72.5%, General Radiologist-I 53.8%, Radiologist-II 51.3%. With ChatGPT-4o, improved to 78.8%, 70.0%, 63.8%, respectively. improvements for Radiologists-I II statistically significant (P≤0.006). All radiologists’ DDx significantly (P≤0.05). Remarkably, Radiologist-I’s GPT-4o-assisted Score not different Cardiac Radiologist’s (P>0.05). Among LLMs, Claude 3.5 Sonnet 3 Opus had highest (81.3%), followed by (70.0%). Regarding Score, outperformed all models (P<0.05). radiologist-III 48.8% 63.8% GPT4o-assistance (P<0.001). Conclusion may enhance imaging, suggesting its potential as a valuable support tool. Further research is required assess clinical integration.

Language: Английский

Citations

Revolution or risk?—Assessing the potential and challenges of GPT-4V in radiologic image interpretation DOI

Marc Huppertz, Robert Siepmann,

David Topp

et al.

European Radiology, Journal Year: 2024, Volume and Issue: unknown

Published: Oct. 18, 2024

Abstract Objectives ChatGPT-4 Vision (GPT-4V) is a state-of-the-art multimodal large language model (LLM) that may be queried using images. We aimed to evaluate the tool’s diagnostic performance when autonomously assessing clinical imaging studies. Materials and methods A total of 206 studies (i.e., radiography ( n = 60), CT MRI angiography 26)) with unequivocal findings established reference diagnoses from radiologic practice university hospital were accessed. Readings performed uncontextualized, only image provided, contextualized, additional demographic information. Responses assessed along multiple dimensions analyzed appropriate statistical tests. Results With its pronounced propensity favor context over information, accuracy improved 8.3% (uncontextualized) 29.1% (contextualized, first diagnosis correct) 63.6% correct among differential diagnoses) p ≤ 0.001, Cochran’s Q test). Diagnostic declined by up 30% 20 images re-read after 30 90 days seemed unrelated self-reported confidence (Spearman’s ρ 0.117 0.776)). While described matched suggested in 92.7%, indicating valid reasoning, tool fabricated 258 412 responses misidentified modalities or anatomic regions 65 Conclusion GPT-4V, current form, cannot reliably interpret Its tendency disregard image, fabricate findings, misidentify details, especially without context, misguide healthcare providers put patients at risk. Key Points Question Can Generative Pre-trained Transformer 4 images—with context? Findings GPT-4V poorly, demonstrating rates 8% (uncontextualized), 29% most likely correct), 64% diagnoses). Clinical relevance The utility commercial models, such as limited. Without errors compromise patient safety decision-making. These models must further refined beneficial.

Language: Английский

Citations

Cultivating diagnostic clarity: The importance of reporting artificial intelligence confidence levels in radiologic diagnoses DOI

Mobina Fathi, Kimia Vakili, Ramtin Hajibeygi

et al.

Clinical Imaging, Journal Year: 2024, Volume and Issue: 117, P. 110356 - 110356

Published: Nov. 13, 2024

Language: Английский

Citations

Prostate-MRI reporting should be done with the aid of AI systems: Pros DOI

Tobias Penzkofer

European Radiology, Journal Year: 2024, Volume and Issue: 34(12), P. 7728 - 7730

Published: July 9, 2024

Language: Английский

Citations