Evaluating GPT-4V's Diagnostic Accuracy and Visual Integration in Neuroradiology: A Case-Based Study Using Board-Style Exam Questions Abstract Background The integration of multimodal capabilities in GPT-4V represents advancement in AI's application to clinical fields, particularly neuroradiology. Despite preliminary evidence of capability in medical imaging interpretation, questions remain about its performance in complex scenarios requiring integrated analysis of clinical history and imaging … DOI
Wei Tao

Опубликована: Дек. 5, 2024

BACKGROUND The integration of multimodal capabilities in GPT-4V represents advancement AI's application to clinical fields, particularly neuroradiology. Despite preliminary evidence capability medical imaging interpretation, questions remain about its performance complex scenarios requiring integrated analysis history and findings. OBJECTIVE To evaluate GPT-4V's diagnostic on neuroradiology board-style multiple-choice questions, integrating both data imaging. METHODS Twenty-nine cases from the RSNA Case Collection, each including vignette CT/MRI images, were presented GPT-4V. model evaluated studies data, selecting options while quantifying relative influence image versus text decision-making. RESULTS achieved 75.86% accuracy, with contributing an average 66.9% final answers. relied more heavily incorrectly answered (75% image-based) compared correct ones (61.74%). CONCLUSIONS findings suggest potential over-reliance where context is crucial. Our results highlight need for improved AI models, future development focusing refining decision-making processes enhance accuracy.

Язык: Английский

Artificial intelligence in rheumatology research: what is it good for? DOI Creative Commons
José Miguel Sequí-Sabater, Diego Benavent

RMD Open, Год журнала: 2025, Номер 11(1), С. e004309 - e004309

Опубликована: Янв. 1, 2025

Artificial intelligence (AI) is transforming rheumatology research, with a myriad of studies aiming to improve diagnosis, prognosis and treatment prediction, while also showing potential capability optimise the research workflow, drug discovery clinical trials. Machine learning, key element discriminative AI, has demonstrated ability accurately classifying rheumatic diseases predicting therapeutic outcomes by using diverse data types, including structured databases, imaging text. In parallel, generative driven large language models, becoming powerful tool for optimising workflow supporting content generation, literature review automation decision support. This explores current applications future both AI in rheumatology. It highlights challenges posed these technologies, such as ethical concerns need rigorous validation regulatory oversight. The integration promises substantial advancements but requires balanced approach benefits minimise possible downsides.

Язык: Английский

Процитировано

1

Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4V in Challenging Brain MRI Cases DOI
Severin Schramm,

Silas Preis,

Marie‐Christin Metz

и другие.

Radiology, Год журнала: 2025, Номер 314(1)

Опубликована: Янв. 1, 2025

Textual descriptions of radiologic image findings play a critical role in GPT-4 with vision–based differential diagnosis, underlining the importance radiologist experts even multimodal large language models.

Язык: Английский

Процитировано

1

Can ChatGPT4-vision identify radiologic progression of multiple sclerosis on brain MRI? DOI Creative Commons
Brendan S. Kelly, Sophie Duignan, Prateek Mathur

и другие.

European Radiology Experimental, Год журнала: 2025, Номер 9(1)

Опубликована: Янв. 15, 2025

Abstract Background The large language model ChatGPT can now accept image input with the GPT4-vision (GPT4V) version. We aimed to compare performance of GPT4V pretrained U-Net and vision transformer (ViT) models for identification progression multiple sclerosis (MS) on magnetic resonance imaging (MRI). Methods Paired coregistered MR images without were provided as ChatGPT4V in a zero-shot experiment identify radiologic progression. Its was compared ViT models. Accuracy primary evaluation metric 95% confidence interval (CIs) calculated by bootstrapping. included 170 patients MS (50 males, 120 females), aged 21–74 years (mean 42.3), imaged at single institution from 2019 2021, each 2–5 MRI studies (496 total). Results One hundred seventy included, 110 training, 30 tuning, testing; 100 unseen paired randomly selected test set evaluation. Both had 94% (95% CI: 89–98%) accuracy while 85% (77–91%). gave cautious nonanswers six cases. precision (specificity), recall (sensitivity), F1 score 89% (75–93%), 92% (82–98%), 91 (82–97%) 100% (100–100%), 88 (78–96%), 0.94 (88–98%) (87–100%), 94 (88–100%), (89–98%) ViT. Conclusion combined its accessibility suggests has potential impact AI radiology research. However, misclassified cases overly non-answers confirm that it is not yet ready clinical use. Relevance statement simplified experimental setting. medical device, widespread availability highlights need caution education lay users, especially those limited access expert healthcare. Key Points Without fine-tuning or prior coding experience, perform change detection task reasonable accuracy. absolute terms, “spot difference” task, inferior state-of-the-art computer methods. GPT4V’s metrics more similar than U-net. This an exploratory study intended use device. Graphical

Язык: Английский

Процитировано

0

Artificial Intelligence-Based Chatbots’ Ability to Interpret Mammography Images: A Comparison of Chat-GPT 4o and Claude 3.5 DOI Open Access
Betül Nalan Karahan, Emre Emekli,

Mahmut Altuğ Altın

и другие.

European Journal of Therapeutics, Год журнала: 2025, Номер 31(1), С. 28 - 34

Опубликована: Фев. 28, 2025

Objectives: The aim of this study is to compare the ability artificial intelligence-based chatbots, ChatGPT-4o and Claude 3.5, interpret mammography images. focuses on evaluating their accuracy consistency in BI-RADS classification breast parenchymal type assessment. It also aims explore potential these technologies reduce radiologists’ workload identify limitations medical image analysis. Methods: A total 53 images obtained between January July 2024 were analyzed, focusing same anonymized provided both chatbots under identical prompts. Results: results showed rates for ranging from 18.87% 26.42% 18.7% 3.5. When categories grouped into benign group(BI-RADS 1,2) malignant 4,5), combined was 57.5% (initial evaluation) 55% (second evaluation), compared 47.5% Breast 30.19% 22.64% ChatGPT-4o, Conclusions: findings indicate that demonstrate limited reliability interpreting These highlight need further optimization, larger datasets, advanced training processes improve performance

Язык: Английский

Процитировано

0

Evaluation of radiology residents’ reporting skills using large language models: an observational study DOI Creative Commons
Natsuko Atsukawa, Hiroyuki Tatekawa, Tatsushi Oura

и другие.

Japanese Journal of Radiology, Год журнала: 2025, Номер unknown

Опубликована: Март 8, 2025

Large language models (LLMs) have the potential to objectively evaluate radiology resident reports; however, research on their use for feedback in training and assessment of skill development remains limited. This study aimed assess effectiveness LLMs revising reports by comparing them with verified board-certified radiologists analyze progression resident's reporting skills over time. To identify LLM that best aligned human radiologists, 100 were randomly selected from 7376 authored nine first-year residents. The evaluated based six criteria: (1) addition missing positive findings, (2) deletion (3) negative (4) correction expression (5) diagnosis, (6) proposal additional examinations or treatments. Reports segmented into four time-based terms, 900 (450 CT 450 MRI) chosen initial final terms residents' first year. revised rates each criterion compared between last using Wilcoxon Signed-Rank test. Among three LLMs-ChatGPT-4 Omni (GPT-4o), Claude-3.5 Sonnet, Claude-3 Opus-GPT-4o demonstrated highest level agreement radiologists. Significant improvements noted Criteria 1-3 when (Criteria 1, 2, 3; P < 0.001, = 0.023, 0.004, respectively) GPT-4o. No significant changes observed 4-6. Despite this, all criteria except 6 showed progressive enhancement can effectively provide commonly corrected areas reports, enabling residents improve weaknesses monitor progress. Additionally, may help reduce workload radiologists' mentors.

Язык: Английский

Процитировано

0

A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians DOI Creative Commons
Hirotaka Takita, Daijiro Kabata, Shannon L. Walston

и другие.

npj Digital Medicine, Год журнала: 2025, Номер 8(1)

Опубликована: Март 22, 2025

Abstract While generative artificial intelligence (AI) has shown potential in medical diagnostics, comprehensive evaluation of its diagnostic performance and comparison with physicians not been extensively explored. We conducted a systematic review meta-analysis studies validating AI models for tasks published between June 2018 2024. Analysis 83 revealed an overall accuracy 52.1%. No significant difference was found ( p = 0.10) or non-expert 0.93). However, performed significantly worse than expert 0.007). Several demonstrated slightly higher compared to non-experts, although the differences were significant. Generative demonstrates promising capabilities varying by model. Although it yet achieved expert-level reliability, these findings suggest enhancing healthcare delivery education when implemented appropriate understanding limitations.

Язык: Английский

Процитировано

0

Exploring Turkish equivalents of terms for musculoskeletal radiology: insights for a standardized terminology DOI Creative Commons
Zeynep Başer

Journal of Health Sciences and Medicine, Год журнала: 2025, Номер 8(2), С. 275 - 285

Опубликована: Март 21, 2025

Aims: This study aimed to provide an analysis of Turkish equivalents English terms for musculoskeletal radiology. Methods: The present focuses on a global endorsement in radiology, and explores how their are used reference books (Turkish translation the books, Diagnostic Imaging: Musculoskeletal: Trauma Non-Traumatic Disease). Furthermore, attempts picture AI-based tools (i.e. neural machine such as DeepL, Google Translate AI Chatbot, ChatGPT) vary these terms. Results: found that most common strategies radiology were borrowing literal translation, with several combined complex like Translate, ChatGPT showed high similarity human translations, but differences observed word choice, strategy use, orthographic variations. These differences, though minor, highlight challenges achieving consistency accuracy AI-generated medical translations. Conclusion: provides list terminology English, presents translations by specialists tools. Careful evaluation is essential ensure terminology, particularly subspecialities

Язык: Английский

Процитировано

0

Comparison of ChatGPT's Diagnostic and Management Accuracy of Foot and Ankle Bone–Related Pathologies to Orthopaedic Surgeons DOI
Maritza Diane Essis, Hayden Hartman, Wei Shao Tung

и другие.

Journal of the American Academy of Orthopaedic Surgeons, Год журнала: 2025, Номер unknown

Опубликована: Апрель 10, 2025

Introduction: The steep rise in utilization of large language model chatbots, such as ChatGPT, has spilled into medicine recent years. newest version ChatGPT-4, passed medical licensure examinations and, specifically orthopaedics, performed at the level a postgraduate three orthopaedic surgery resident on Orthopaedic In-Service Training Examination question bank sets. purpose this study was to evaluate ChatGPT-4's diagnostic and decision-making capacity clinical management bone-related injuries foot ankle. Methods: Eight ankle cases were presented ChatGPT-4 subsequently evaluated by fellowship-trained surgeons. Cases scored using criteria Likert scale, graded from total score 5 (lowest) 25 (highest) across five criteria. referred “Dr. GPT,” establishing peer dynamic so that role an surgeon emulated chatbot. Results: average all for each case 4.53 5, noting overall sum 22.7 cases. pathology with highest second metatarsal stress fracture (24.3), whereas lowest hallux rigidus (21.3). Kendall correlation analysis interrater reliability showed variable among surgeons, without statistical significance. Conclusion: effectively diagnosed provided appropriate treatment options simple Importantly, ChatGPT did not fabricate (ie, hallucination phenomenon), which been previously well-documented literature, notably receiving its second-highest criterion. struggled provide comprehensive information beyond standard options. Overall, potential serve widely accessible resource patients nonorthopaedic clinicians, although limitations may exist delivery information.

Язык: Английский

Процитировано

0

Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis DOI Creative Commons

Guxue Shan,

Xiaonan Chen,

Chen Wang

и другие.

JMIR Medical Informatics, Год журнала: 2025, Номер 13, С. e64963 - e64963

Опубликована: Апрель 25, 2025

Abstract Background With the rapid development of artificial intelligence (AI) technology, especially generative AI, large language models (LLMs) have shown great potential in medical field. Through massive data training, it can understand complex texts and quickly analyze records provide health counseling diagnostic advice directly, rare diseases. However, no study has yet compared extensively discussed performance LLMs with that physicians. Objective This systematically reviewed accuracy clinical diagnosis provided reference for further application. Methods We conducted searches CNKI (China National Knowledge Infrastructure), VIP Database, SinoMed, PubMed, Web Science, Embase, CINAHL (Cumulative Index to Nursing Allied Health Literature) from January 1, 2017, present. A total 2 reviewers independently screened literature extracted relevant information. The risk bias was assessed using Prediction Model Risk Bias Assessment Tool (PROBAST), which evaluates both applicability included studies. Results 30 studies involving 19 a 4762 cases were included. quality assessment indicated high majority studies, primary cause is known case diagnosis. For optimal model, ranged 25% 97.8%, while triage 66.5% 98%. Conclusions demonstrated considerable capabilities significant application across various cases. Although their still falls short professionals, if used cautiously, they become one best intelligent assistants field human care.

Язык: Английский

Процитировано

0

Comparing Large Language Model and Human Reader Accuracy with New England Journal of Medicine Image Challenge Case Image Inputs DOI
Pae Sun Suh, Woo Hyun Shim, Chong Hyun Suh

и другие.

Radiology, Год журнала: 2024, Номер 313(3)

Опубликована: Дек. 1, 2024

Large language models accurately answered New England Journal of Medicine Image Challenge cases using radiologic inputs, outperforming a medical student; however, their accuracy decreased with shorter text lengths, regardless image inputs.

Язык: Английский

Процитировано

3