Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science Exam DOI Open Access
Nabor C. Mendonça

ACM Transactions on Computing Education, Journal Year: 2024, Volume and Issue: 24(3), P. 1 - 56

Published: June 20, 2024

The recent integration of visual capabilities into Large Language Models (LLMs) has the potential to play a pivotal role in science and technology education, where elements such as diagrams, charts, tables are commonly used improve learning experience. This study investigates performance ChatGPT-4 Vision, OpenAI’s most advanced model at time was conducted, on Bachelor Computer Science section Brazil’s 2021 National Undergraduate Exam (ENADE). By presenting with exam’s open multiple-choice questions their original image format allowing for reassessment response differing answer keys, we were able evaluate model’s reasoning self-reflecting large-scale academic assessment involving textual content. Vision significantly outperformed average exam participant, positioning itself within top 10 best score percentile. While it excelled that incorporated elements, also encountered challenges question interpretation, logical reasoning, acuity. A positive correlation between distribution human participants suggests multimodal LLMs can provide useful tool testing refinement. However, involvement an independent expert panel review cases disagreement key revealed some poorly constructed containing vague or ambiguous statements, calling attention critical need improved design future exams. Our findings suggest while shows promise evaluations, oversight remains crucial verifying accuracy ensuring fairness high-stakes educational paper’s research materials publicly available https://github.com/nabormendonca/gpt-4v-enade-cs-2021 .

Language: Английский

Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs DOI Creative Commons
Wang Li, Xi Chen, Xiangwen Deng

et al.

npj Digital Medicine, Journal Year: 2024, Volume and Issue: 7(1)

Published: Feb. 20, 2024

Abstract The use of large language models (LLMs) in clinical medicine is currently thriving. Effectively transferring LLMs’ pertinent theoretical knowledge from computer science to their application crucial. Prompt engineering has shown potential as an effective method this regard. To explore the prompt LLMs and examine reliability LLMs, different styles prompts were designed used ask about agreement with American Academy Orthopedic Surgeons (AAOS) osteoarthritis (OA) evidence-based guidelines. Each question was asked 5 times. We compared consistency findings guidelines across evidence levels for assessed by asking same gpt-4-Web ROT prompting had highest overall (62.9%) a significant performance strong recommendations, total 77.5%. not stable (Fleiss kappa ranged −0.002 0.984). This study revealed that variable effects various models, most consistent. An appropriate could improve accuracy responses professional medical questions.

Language: Английский

Citations

92

Generative artificial intelligence in healthcare: A scoping review on benefits, challenges and applications DOI
Khadijeh Moulaei,

Atiye Yadegari,

Mahdi Baharestani

et al.

International Journal of Medical Informatics, Journal Year: 2024, Volume and Issue: 188, P. 105474 - 105474

Published: May 8, 2024

Language: Английский

Citations

48

The impact of large language models on radiology: a guide for radiologists on the latest innovations in AI DOI Creative Commons
Takeshi Nakaura, Rintaro Ito, Daiju Ueda

et al.

Japanese Journal of Radiology, Journal Year: 2024, Volume and Issue: 42(7), P. 685 - 696

Published: March 29, 2024

Abstract The advent of Deep Learning (DL) has significantly propelled the field diagnostic radiology forward by enhancing image analysis and interpretation. introduction Transformer architecture, followed development Large Language Models (LLMs), further revolutionized this domain. LLMs now possess potential to automate refine workflow, extending from report generation assistance in diagnostics patient care. integration multimodal technology with could potentially leapfrog these applications unprecedented levels. However, come unresolved challenges such as information hallucinations biases, which can affect clinical reliability. Despite issues, legislative guideline frameworks have yet catch up technological advancements. Radiologists must acquire a thorough understanding technologies leverage LLMs’ fullest while maintaining medical safety ethics. This review aims aid that endeavor.

Language: Английский

Citations

32

Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in “Diagnosis Please” cases DOI Creative Commons

Yuki Sonoda,

Ryo Kurokawa, Yuta Nakamura

et al.

Japanese Journal of Radiology, Journal Year: 2024, Volume and Issue: 42(11), P. 1231 - 1235

Published: July 1, 2024

Abstract Purpose Large language models (LLMs) are rapidly advancing and demonstrating high performance in understanding textual information, suggesting potential applications interpreting patient histories documented imaging findings. As LLMs continue to improve, their diagnostic abilities expected be enhanced further. However, there is a lack of comprehensive comparisons between from different manufacturers. In this study, we aimed test the three latest major (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro) using Radiology Diagnosis Please Cases, monthly quiz series for radiology experts. Materials methods Clinical history findings, provided textually by case submitters, were extracted 324 questions originating cases published 1998 2023. The top differential diagnoses generated GPT-4o, Pro, respective application programming interfaces. A comparative analysis among these was conducted Cochrane’s Q post hoc McNemar’s tests. Results accuracies Pro primary diagnosis 41.0%, 54.0%, 33.9%, which further improved 49.4%, 62.0%, when considering accuracy any diagnoses. Significant differences observed all pairs models. Conclusion Opus outperformed GPT-4o solving cases. These appear capable assisting radiologists supplied with accurate evaluations worded descriptions

Language: Английский

Citations

28

ChatGPT in radiology: A systematic review of performance, pitfalls, and future perspectives DOI Creative Commons

Pedram Keshavarz,

Sara Bagherieh,

Seyed Ali Nabipoorashrafi

et al.

Diagnostic and Interventional Imaging, Journal Year: 2024, Volume and Issue: 105(7-8), P. 251 - 265

Published: April 27, 2024

The purpose of this study was to systematically review the reported performances ChatGPT, identify potential limitations, and explore future directions for its integration, optimization, ethical considerations in radiology applications.

Language: Английский

Citations

27

GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination DOI Creative Commons
Yuichiro Hirano, Shouhei Hanaoka, Takahiro Nakao

et al.

Japanese Journal of Radiology, Journal Year: 2024, Volume and Issue: 42(8), P. 918 - 926

Published: May 11, 2024

Abstract Purpose To assess the performance of GPT-4 Turbo with Vision (GPT-4TV), OpenAI’s latest multimodal large language model, by comparing its ability to process both text and image inputs that text-only (GPT-4 T) in context Japan Diagnostic Radiology Board Examination (JDRBE). Materials methods The dataset comprised questions from JDRBE 2021 2023. A total six board-certified diagnostic radiologists discussed provided ground-truth answers consulting relevant literature as necessary. following were excluded: those lacking associated images, no unanimous agreement on answers, including images rejected OpenAI application programming interface. for GPT-4TV included whereas T entirely text. Both models deployed dataset, their was compared using McNemar’s exact test. radiological credibility responses assessed two through assignment legitimacy scores a five-point Likert scale. These subsequently used compare model Wilcoxon's signed-rank Results 139 questions. correctly answered 62 (45%), 57 (41%). statistical analysis found significant difference between (P = 0.44). received significantly lower than responses. Conclusion No enhancement accuracy observed when input

Language: Английский

Citations

26

Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study DOI Creative Commons
Giacomo Rossettini,

Lia Rodeghiero,

Federica Corradi

et al.

BMC Medical Education, Journal Year: 2024, Volume and Issue: 24(1)

Published: June 26, 2024

Abstract Background Artificial intelligence (AI) chatbots are emerging educational tools for students in healthcare science. However, assessing their accuracy is essential prior to adoption settings. This study aimed assess the of predicting correct answers from three AI (ChatGPT-4, Microsoft Copilot and Google Gemini) Italian entrance standardized examination test science degrees (CINECA test). Secondarily, we assessed narrative coherence chatbots’ responses (i.e., text output) based on qualitative metrics: logical rationale behind chosen answer, presence information internal question, external question. Methods An observational cross-sectional design was performed September 2023. Accuracy evaluated CINECA test, where questions were formatted using a multiple-choice structure with single best answer. The outcome binary (correct or incorrect). Chi-squared post hoc analysis Bonferroni correction differences among performance accuracy. A p -value < 0.05 considered statistically significant. sensitivity performed, excluding that not applicable (e.g., images). Narrative analyzed by absolute relative frequencies errors. Results Overall, 820 inputted into all chatbots, 20 imported ChatGPT-4 ( n = 808) Gemini due technical limitations. We found significant vs comparisons 0.001). revealed “Logical reasoning” as prevalent answer 622, 81.5%) error” incorrect 40, 88.9%). Conclusions Our main findings reveal that: (A) well; (B) better than Gemini; (C) primarily logical. Although showed promising university encourage candidates cautiously incorporate this new technology supplement learning rather primary resource. Trial registration Not required.

Language: Английский

Citations

24

Comparison of Large Language Models in Answering Immuno-Oncology Questions: A Cross-Sectional Study DOI Creative Commons
Giovanni Maria Iannantuono, Dara Bracken-Clarke, Fatima Karzai

et al.

The Oncologist, Journal Year: 2024, Volume and Issue: 29(5), P. 407 - 414

Published: Feb. 3, 2024

Abstract Background The capability of large language models (LLMs) to understand and generate human-readable text has prompted the investigation their potential as educational management tools for patients with cancer healthcare providers. Materials Methods We conducted a cross-sectional study aimed at evaluating ability ChatGPT-4, ChatGPT-3.5, Google Bard answer questions related 4 domains immuno-oncology (Mechanisms, Indications, Toxicities, Prognosis). generated 60 open-ended (15 each section). Questions were manually submitted LLMs, responses collected on June 30, 2023. Two reviewers evaluated answers independently. Results ChatGPT-4 ChatGPT-3.5 answered all questions, whereas only 53.3% (P &lt; .0001). number reproducible was higher (95%) ChatGPT3.5 (88.3%) than (50%) In terms accuracy, deemed fully correct 75.4%, 58.5%, 43.8% Bard, respectively = .03). Furthermore, highly relevant 71.9%, 77.4%, .04). Regarding readability, readable (98.1%) (100%) compared (87.5%) .02). Conclusion are potentially powerful in immuno-oncology, demonstrated relatively poorer performance. However, risk inaccuracy or incompleteness evident 3 highlighting importance expert-driven verification outputs returned by these technologies.

Language: Английский

Citations

19

Performance of GPT-4 on the American College of Radiology In-training Examination: Evaluating Accuracy, Model Drift, and Fine-tuning DOI
David L. Payne,

Kush Purohit,

Walter Morales Borrero

et al.

Academic Radiology, Journal Year: 2024, Volume and Issue: 31(7), P. 3046 - 3054

Published: April 22, 2024

Language: Английский

Citations

16

Diagnostic accuracy of vision-language models on Japanese diagnostic radiology, nuclear medicine, and interventional radiology specialty board examinations DOI Creative Commons
Tatsushi Oura, Hiroyuki Tatekawa, Daisuke Horiuchi

et al.

Japanese Journal of Radiology, Journal Year: 2024, Volume and Issue: 42(12), P. 1392 - 1398

Published: July 20, 2024

Abstract Purpose The performance of vision-language models (VLMs) with image interpretation capabilities, such as GPT-4 omni (GPT-4o), vision (GPT-4V), and Claude-3, has not been compared remains unexplored in specialized radiological fields, including nuclear medicine interventional radiology. This study aimed to evaluate compare the diagnostic accuracy various VLMs, + GPT-4V, GPT-4o, Claude-3 Sonnet, Opus, using Japanese radiology, medicine, radiology (JDR, JNM, JIR, respectively) board certification tests. Materials methods In total, 383 questions from JDR test (358 images), 300 JNM (92 322 JIR (96 images) 2019 2023 were consecutively collected. rates Opus calculated for all or images. VLMs McNemar’s test. Results GPT-4o demonstrated highest across evaluations (all questions, 49%; images, 48%), 64%; 59%), tests 43%; 34%), followed by 40%; 38%), 42%; 43%), 30%). For showed that significantly outperformed other P < 0.007), except 0.001), Conclusion had success images JDR,

Language: Английский

Citations

12