Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5 and Humans in Clinical Chemistry Multiple-Choice Questions DOI Creative Commons
Malik Sallam,

Khaled Al‐Salahat,

Huda Eid

и другие.

Advances in Medical Education and Practice, Год журнала: 2024, Номер Volume 15, С. 857 - 871

Опубликована: Сен. 1, 2024

Artificial intelligence (AI) chatbots excel in language understanding and generation. These models can transform healthcare education practice. However, it is important to assess the performance of such AI various topics highlight its strengths possible limitations. This study aimed evaluate ChatGPT (GPT-3.5 GPT-4), Bing, Bard compared human students at a postgraduate master's level Medical Laboratory Sciences.

Язык: Английский

Generative artificial intelligence in healthcare: A scoping review on benefits, challenges and applications DOI
Khadijeh Moulaei,

Atiye Yadegari,

Mahdi Baharestani

и другие.

International Journal of Medical Informatics, Год журнала: 2024, Номер 188, С. 105474 - 105474

Опубликована: Май 8, 2024

Язык: Английский

Процитировано

44

A systematic review of large language models and their implications in medical education DOI Creative Commons

Harrison C. Lucas,

Jeffrey S. Upperman, Jamie R. Robinson

и другие.

Medical Education, Год журнала: 2024, Номер unknown

Опубликована: Апрель 19, 2024

Abstract Introduction In the past year, use of large language models (LLMs) has generated significant interest and excitement because their potential to revolutionise various fields, including medical education for aspiring physicians. Although students undergo a demanding educational process become competent health care professionals, emergence LLMs presents promising solution challenges like information overload, time constraints pressure on clinical educators. However, integrating into raises critical concerns educators, professionals students. This systematic review aims explore LLM applications in education, specifically impact students' learning experiences. Methods A search was performed PubMed, Web Science Embase articles discussing using selected keywords related from ChatGPT's debut until February 2024. Only available full text or English were reviewed. The credibility each study critically appraised by two independent reviewers. Results identified 166 studies, which 40 found be relevant study. Among key themes included capabilities, benefits such as personalised regarding content accuracy. Importantly, 42.5% these studies evaluated novel way, ChatGPT, contexts exams clinical/biomedical information, highlighting replicating human‐level performance knowledge. remaining broadly discussed prospective role reflecting keen future despite current constraints. Conclusions responsible implementation offers opportunity enhance ensuring accuracy, emphasising skill‐building maintaining ethical safeguards are crucial. Continuous evaluation interdisciplinary collaboration essential appropriate integration education.

Язык: Английский

Процитировано

43

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: A Systematic Review and Meta-Analysis (Preprint) DOI Creative Commons
Mingxin Liu, Tsuyoshi Okuhara, Xinyi Chang

и другие.

Journal of Medical Internet Research, Год журнала: 2024, Номер 26, С. e60807 - e60807

Опубликована: Июнь 15, 2024

Over the past 2 years, researchers have used various medical licensing examinations to test whether ChatGPT (OpenAI) possesses accurate knowledge. The performance of each version on examination in multiple environments showed remarkable differences. At this stage, there is still a lack comprehensive understanding variability ChatGPT's different examinations.

Язык: Английский

Процитировано

32

Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study DOI Creative Commons
Takahiro Nakao, Soichiro Miki, Yuta Nakamura

и другие.

JMIR Medical Education, Год журнала: 2024, Номер 10, С. e54393 - e54393

Опубликована: Март 12, 2024

Previous research applying large language models (LLMs) to medicine was focused on text-based information. Recently, multimodal variants of LLMs acquired the capability recognizing images.

Язык: Английский

Процитировано

26

GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination DOI Creative Commons
Yuichiro Hirano, Shouhei Hanaoka, Takahiro Nakao

и другие.

Japanese Journal of Radiology, Год журнала: 2024, Номер 42(8), С. 918 - 926

Опубликована: Май 11, 2024

Abstract Purpose To assess the performance of GPT-4 Turbo with Vision (GPT-4TV), OpenAI’s latest multimodal large language model, by comparing its ability to process both text and image inputs that text-only (GPT-4 T) in context Japan Diagnostic Radiology Board Examination (JDRBE). Materials methods The dataset comprised questions from JDRBE 2021 2023. A total six board-certified diagnostic radiologists discussed provided ground-truth answers consulting relevant literature as necessary. following were excluded: those lacking associated images, no unanimous agreement on answers, including images rejected OpenAI application programming interface. for GPT-4TV included whereas T entirely text. Both models deployed dataset, their was compared using McNemar’s exact test. radiological credibility responses assessed two through assignment legitimacy scores a five-point Likert scale. These subsequently used compare model Wilcoxon's signed-rank Results 139 questions. correctly answered 62 (45%), 57 (41%). statistical analysis found significant difference between (P = 0.44). received significantly lower than responses. Conclusion No enhancement accuracy observed when input

Язык: Английский

Процитировано

25

Recent Advances in Large Language Models for Healthcare DOI Creative Commons
Khalid Nassiri, Moulay A. Akhloufi

BioMedInformatics, Год журнала: 2024, Номер 4(2), С. 1097 - 1143

Опубликована: Апрель 16, 2024

Recent advances in the field of large language models (LLMs) underline their high potential for applications a variety sectors. Their use healthcare, particular, holds out promising prospects improving medical practices. As we highlight this paper, LLMs have demonstrated remarkable capabilities understanding and generation that could indeed be put to good field. We also present main architectures these models, such as GPT, Bloom, or LLaMA, composed billions parameters. then examine recent trends datasets used train models. classify them according different criteria, size, source, subject (patient records, scientific articles, etc.). mention help improve patient care, accelerate research, optimize efficiency healthcare systems assisted diagnosis. several technical ethical issues need resolved before can extensively Consequently, propose discussion offered by new generations linguistic limitations when deployed domain healthcare.

Язык: Английский

Процитировано

23

Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study DOI Creative Commons
Giacomo Rossettini,

Lia Rodeghiero,

Federica Corradi

и другие.

BMC Medical Education, Год журнала: 2024, Номер 24(1)

Опубликована: Июнь 26, 2024

Abstract Background Artificial intelligence (AI) chatbots are emerging educational tools for students in healthcare science. However, assessing their accuracy is essential prior to adoption settings. This study aimed assess the of predicting correct answers from three AI (ChatGPT-4, Microsoft Copilot and Google Gemini) Italian entrance standardized examination test science degrees (CINECA test). Secondarily, we assessed narrative coherence chatbots’ responses (i.e., text output) based on qualitative metrics: logical rationale behind chosen answer, presence information internal question, external question. Methods An observational cross-sectional design was performed September 2023. Accuracy evaluated CINECA test, where questions were formatted using a multiple-choice structure with single best answer. The outcome binary (correct or incorrect). Chi-squared post hoc analysis Bonferroni correction differences among performance accuracy. A p -value < 0.05 considered statistically significant. sensitivity performed, excluding that not applicable (e.g., images). Narrative analyzed by absolute relative frequencies errors. Results Overall, 820 inputted into all chatbots, 20 imported ChatGPT-4 ( n = 808) Gemini due technical limitations. We found significant vs comparisons 0.001). revealed “Logical reasoning” as prevalent answer 622, 81.5%) error” incorrect 40, 88.9%). Conclusions Our main findings reveal that: (A) well; (B) better than Gemini; (C) primarily logical. Although showed promising university encourage candidates cautiously incorporate this new technology supplement learning rather primary resource. Trial registration Not required.

Язык: Английский

Процитировано

22

Evaluating ChatGPT performance in Arabic dialects: A comparative study showing defects in responding to Jordanian and Tunisian general health prompts DOI Creative Commons
Malik Sallam, Dhia Mousa

Mesopotamian Journal of Artificial Intelligence in Healthcare, Год журнала: 2024, Номер 2024, С. 1 - 7

Опубликована: Янв. 10, 2024

Background: The role of artificial intelligence (AI) is increasingly recognized to enhance digital health literacy. There particular importance with widespread availability and popularity AI chatbots such as ChatGPT its possible impact on involves the need understand models’ performance across different languages, dialects, cultural contexts. This study aimed evaluate in response prompting two Arabic namely Tunisian Jordanian. Methods: descriptive followed METRICS checklist for design reporting based studies healthcare. Ten general queries were translated into Jordanian dialects by bilingual native speakers. models, ChatGPT-3.5 ChatGPT-4 Tunisian, Jordanian, English evaluated using CLEAR tool tailored assessment information generated models. Results: was categorized average Arabic, an overall score 2.83, compared above 3.40 Arabic. showed a similar pattern marginally better outcomes 3.20 rated 3.53. components consistently superior dialect both models despite lack statistical significance. Using content reference, responses significantly inferior (P<.001). Conclusion: findings highlight critical dialectical gap ChatGPT, underlining linguistic diversity development, particularly health-related content. Collaborative efforts among developers, linguists, healthcare professionals are needed improve Future recommended broaden scope extensive range languages which would help achieving equitable access various communities.

Язык: Английский

Процитировано

16

AI-driven translations for kidney transplant equity in Hispanic populations DOI Creative Commons
Oscar A. Garcia Valencia, Charat Thongprayoon, Caroline C. Jadlowiec

и другие.

Scientific Reports, Год журнала: 2024, Номер 14(1)

Опубликована: Апрель 12, 2024

Abstract Health equity and accessing Spanish kidney transplant information continues being a substantial challenge facing the Hispanic community. This study evaluated ChatGPT’s capabilities in translating 54 English frequently asked questions (FAQs) into using two versions of AI model, GPT-3.5 GPT-4.0. The FAQs included 19 from Organ Procurement Transplantation Network (OPTN), 15 National Service (NHS), 20 Kidney Foundation (NKF). Two native Spanish-speaking nephrologists, both whom are Mexican heritage, scored translations for linguistic accuracy cultural sensitivity tailored to Hispanics 1–5 rubric. inter-rater reliability evaluators, measured by Cohen’s Kappa, was 0.85. Overall 4.89 ± 0.31 versus 4.94 0.23 GPT-4.0 (non-significant p = 0.23). Both 4.96 0.19 (p 1.00). By source, 4.84 0.37 4.93 0.26 4.90 4.95 0.22 For sensitivity, 5.00 0.00 (NKF), while These high scores demonstrate Chat GPT effectively translated across systems. findings suggest GPT’s potential promote health improving access essential information. Additional research should evaluate its medical translation diverse contexts/languages. English-to-Spanish may increase vital underserved patients.

Язык: Английский

Процитировано

12

Accuracy and Repeatability of ChatGPT Based on a Set of Multiple-Choice Questions on Objective Tests of Hearing DOI Open Access
Krzysztof Kochanek, Henryk Skarżyńśki, W. Wiktor Jędrzejczak

и другие.

Cureus, Год журнала: 2024, Номер unknown

Опубликована: Май 8, 2024

Introduction: ChatGPT has been tested in many disciplines, but only a few have involved hearing diagnosis and none to physiology or audiology more generally. The consistency of the chatbot's responses same question posed multiple times not well investigated either. This study aimed assess accuracy repeatability 3.5 4 on test questions concerning objective measures hearing. Of particular interest was short-term which here four separate days extended over one week. Methods: We used 30 single-answer, multiple-choice exam from one-year course methods testing were five both (the free version) paid each (two week two following week). evaluated terms response key. To evaluate time, percent agreement Cohen's Kappa calculated. Results: overall 48-49%, while that 65-69%. consistently failed pass threshold 50% correct responses. Within single day, 76-79% for 87-88% (Cohen's 0.67-0.71 0.81-0.84 respectively). between different 75-79% 85-88% 0.65-0.69 0.80-0.85 Conclusion: outperforms higher time. However, great variability casts doubt possible professional applications versions.

Язык: Английский

Процитировано

11