Evaluating ChatGPT's effectiveness and tendencies in Japanese internal medicine DOI
Yudai Kaneda,

Akari Tayuinosho,

Rika Tomoyose

et al.

Journal of Evaluation in Clinical Practice, Journal Year: 2024, Volume and Issue: 30(6), P. 1017 - 1023

Published: May 19, 2024

ChatGPT, a large-scale language model, is notable example of AI's potential in health care. However, its effectiveness clinical settings, especially when compared to human physicians, not fully understood. This study evaluates ChatGPT's capabilities and limitations answering questions for Japanese internal medicine specialists, aiming clarify accuracy tendencies both correct incorrect responses.

Language: Английский

Generative Artificial Intelligence and Education DOI
Edward Palmer,

Walter Barbieri

Springer briefs in education, Journal Year: 2025, Volume and Issue: unknown, P. 117 - 130

Published: Jan. 1, 2025

Language: Английский

Citations

0

Evaluating the Performance of Large Language Models (LLMs) in Answering and Analysing the Chinese Dental Licensing Examination DOI Open Access

Y. Y. Xiong,

Zongqian Zhan,

Chuwen Zhong

et al.

European Journal Of Dental Education, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 31, 2025

This study aimed to simulate diverse scenarios of students employing LLMs for CDLE examination preparation, providing a detailed evaluation their performance in medical education. A stratified random sampling strategy was implemented select and subsequently revise 200 questions from the CDLE. Seven LLMs, recognised exceptional Chinese domain, were selected as test subjects. Three distinct testing constructed: answering questions, explaining adversarial testing. The metrics included accuracy, agreement rate teaching effectiveness score. Wald χ2 tests Kruskal-Wallis employed determine whether differences among across various before after statistically significant. majority tested met passing threshold on benchmark, with Doubao-pro 32k Qwen2-72b (81%) achieving highest accuracy rates. demonstrated 98% reference answers when explanations. Although significant existed scores based Likert scale, all these models commendable ability deliver comprehensible effective instructional content. In testing, GPT-4 exhibited smallest decline (2%, p = 0.623), while ChatGLM-4 least reduction (14.6%, 0.001). trained corpora, such 32k, superior compared no difference. However, during diminished performance, displaying comparatively greater robustness. Future research should further investigate interpretability LLM outputs develop strategies mitigate hallucinations generated

Language: Английский

Citations

0

Evaluating the Performance of Large Language Models in Anatomy Education Advancing Anatomy Learning with ChatGPT-4o DOI Open Access
Fatma Ok, Burak Karip, Fulya Temizsoy Korkmaz

et al.

European Journal of Therapeutics, Journal Year: 2025, Volume and Issue: 31(1), P. 35 - 43

Published: Feb. 28, 2025

Objective: Large language models (LLMs), such as ChatGPT, Gemini, and Copilot, have garnered significant attention across various domains, including education. Their application is becoming increasingly prevalent, particularly in medical education, where rapid access to accurate up-to-date information imperative. This study aimed assess the validity, accuracy, comprehensiveness of utilizing LLMs for preparation lecture notes school anatomy Methods: The evaluated performance four large models—ChatGPT-4o, ChatGPT-4o-Mini, Copilot—in generating students. In first phase, produced by these using identical prompts were compared a widely used textbook through thematic analysis relevance alignment with standard educational materials. second generated content validity index (CVI) analysis. threshold values S-CVI/Ave S-CVI/UA set at 0.90 0.80, respectively, determine acceptability content. Results: ChatGPT-4o demonstrated highest performance, achieving theme success rate 94.6% subtheme 76.2%. ChatGPT-4o-Mini followed, rates 89.2% 62.3%, respectively. Copilot achieved moderate results, 91.8% 54.9%, while Gemini showed lowest 86.4% 52.3%. Content Validity Index analysis, again outperformed other models, exceeding thresholds an value 0.943 0.857. met (0.714) but fell slightly short (0.800). however, exhibited significantly lower CVI results. 0.486 0.286, obtained scores, 0.286 0.143. Conclusion: assessed two distinct methods, revealing that performed best both evaluations. These results suggest educators students could benefit from adopting supplementary tool generation. Conversely, like require further improvements meet standards necessary reliable use

Language: Английский

Citations

0

Is your curriculum GenAI-proof? A method for GenAI impact assessment and a case study DOI Creative Commons

Remco Jongkind,

Erik Elings,

Erik Joukes

et al.

MedEdPublish, Journal Year: 2025, Volume and Issue: 15, P. 11 - 11

Published: March 26, 2025

Background Generative AI (GenAI) such as ChatGPT can take over tasks that previously could only be done by humans. Although GenAI provides many educational opportunities, it also poses risks invalid assessments and irrelevant learning outcomes. This article presents a broadly applicable method to (1) determine current assessment validity, (2) assess which outcomes are impacted student use (3) decide whether alter formats and/or is exemplified the case-study on our medical informatics curriculum. We developed five-step evaluate address impact of GenAI. In collaborative manner, courses in curriculum analysed their plans together with teachers, adapted usage. Results 57% assessments, especially writing programming, were at risk reduced validity relevance. was closer related content structure than complexity according Bloom’s taxonomy. During retreats, lecturers discussed relevance students should able achieve them or without Furthermore, results led plan increase literacy years study. Subsequently coordinators asked either adjust preclude use, include literacy. For 64% format for 36% adapted. Conclusion The majority outcomes, leading us adapt offer potential blueprint institutions facing similar challenges.

Language: Английский

Citations

0

Performance of Large Language Models (ChatGPT and Gemini Advanced) in Gastrointestinal Pathology and Clinical Review of Applications in Gastroenterology DOI Open Access

Swachi Jain,

Baidarbhi Chakraborty,

Ashish Agarwal

et al.

Cureus, Journal Year: 2025, Volume and Issue: unknown

Published: April 2, 2025

Introduction Artificial intelligence (AI) chatbots have been widely tested in their performance on various examinations, with limited data clinical scenarios. The role of Chat Generative Pre-Trained Transformer (ChatGPT) (OpenAI, San Francisco, California, United States) and Gemini Advanced (Google LLC, Mountain View, multiple aspects gastroenterology including answering patient questions, providing medical advice, as tools to potentially assist healthcare providers has shown some promise, though associated many limitations. We aimed study the ChatGPT-4.0, ChatGPT-3.5, across 20 clinicopathologic scenarios unexplored realm gastrointestinal pathology. Materials methods Twenty clinicopathological pathology were provided these three large language models. Two fellowship-trained pathologists independently assessed responses, evaluating both diagnostic accuracy confidence results then compared using chi-squared test. also evaluated each model's ability four key areas, namely, (1) provide differential diagnoses, (2) interpretation immunohistochemical stains, (3) deliver a concise final diagnosis, (4) explanation for thought process, five-point scoring system. mean, median score±standard deviation (SD), interquartile ranges calculated. A comparative analysis parameters was conducted Mann-Whitney U p-value <0.05 considered statistically significant. Other tumor, node, metastasis (TNM) stage incidence pseudo-references "hallucinations" while citing reference material. Results (diagnostic accuracy: p=0.01; diagnosis: p=0.03) ChatGPT-4.0 (interpretation immunohistochemistry (IHC) stains: p=0.001; p=0.002) performed significantly better certain realms than indicating continuously improving training sets. However, mean performances ranged between 3.0 3.7 at best classified average. None models could accurate TNM staging scenarios, 25-50% references that do not exist (hallucinations). Conclusion This indicated are evolving, they need human supervision definite improvements before being used medicine. is first its kind our knowledge.

Language: Английский

Citations

0

Semantic Clinical Artificial Intelligence vs Native Large Language Model Performance on the USMLE DOI Creative Commons
Peter L. Elkin, G.K. Mehta, Frank LeHouillier

et al.

JAMA Network Open, Journal Year: 2025, Volume and Issue: 8(4), P. e256359 - e256359

Published: April 22, 2025

Importance Large language models (LLMs) are being implemented in health care. Enhanced accuracy and methods to maintain over time needed maximize LLM benefits. Objective To evaluate whether performance on the US Medical Licensing Examination (USMLE) can be improved by including formally represented semantic clinical knowledge. Design, Setting, Participants This comparative effectiveness research study was conducted between June 2024 February 2025 at Department of Biomedical Informatics, Jacobs School Medicine Sciences, University Buffalo, New York, using sample questions from USMLE Steps 1, 2, 3. Intervention Semantic artificial intelligence (SCAI) developed insert knowledge into LLMs retrieval augmented generation (RAG). Main Outcomes Measures The SCAI method evaluated comparing 3 Llama (13B, 70B, 405B; Meta) with without RAG text-based for answering determined output answer key. Results were tested 87 Step 103 123 13B enhanced associated significantly 1 but only met 60% passing threshold (74 correct [60.2%]). 70B 405B passed all steps RAG. model scored 80 (92.0%) correctly 82 (79.6%) 112 (91.1%) 79 (90.8%) (84.5%) 117 (95.1%) Significant improvements found 3, parameter better than model, not model. Conclusions Relevance In this study, scores RAG, well or augmentation. forms reasoning LLMs, like reasoning, have potential improve important medical questions. Improving care targeted, up-to-date is an step implementation acceptance.

Language: Английский

Citations

0

Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis DOI Creative Commons

Ling Wang,

Jinglin Li,

Boyang Zhuang

et al.

Journal of Medical Internet Research, Journal Year: 2025, Volume and Issue: 27, P. e64486 - e64486

Published: April 30, 2025

Large language models (LLMs) have flourished and gradually become an important research application direction in the medical field. However, due to high degree of specialization, complexity, specificity medicine, which results extremely accuracy requirements, controversy remains about whether LLMs can be used More studies evaluated performance various types but conclusions are inconsistent. This study uses a network meta-analysis (NMA) assess when answering clinical questions provide high-level evidence-based evidence for its future development In this systematic review NMA, we searched PubMed, Embase, Web Science, Scopus from inception until October 14, 2024. Studies on were included screened by reading published reports. The NMA conducted compare different questions, including objective open-ended top 1 diagnosis, 3 5 triage classification. was performed using Bayesian frequency theory methods. Indirect intercomparisons between programs grading scale. A larger surface under cumulative ranking curve (SUCRA) value indicates higher corresponding LLM accuracy. examined 168 articles encompassing 35,896 3063 cases. Of studies, 40 (23.8%) considered low risk bias, 128 (76.2%) had moderate risk, none rated as having risk. ChatGPT-4o (SUCRA=0.9207) demonstrated strong terms followed Aeyeconsult (SUCRA=0.9187) ChatGPT-4 (SUCRA=0.8087). (SUCRA=0.8708) excelled at questions. diagnosis cases, human experts (SUCRA=0.9001 SUCRA=0.7126, respectively) ranked highest, while Claude Opus (SUCRA=0.9672) well diagnosis. Gemini (SUCRA=0.9649) highest SUCRA area Our that has advantage For may more credible. Humans accurate performs better classification, is advantageous. analysis offers valuable insights clinicians practitioners, empowering them effectively leverage improved decision-making learning, management scenarios. PROSPERO CRD42024558245; https://www.crd.york.ac.uk/PROSPERO/view/CRD42024558245.

Language: Английский

Citations

0

Assessing the Current Limitations of Large-Language Models in Advancing Healthcare Education (Preprint) DOI Creative Commons

Janghyeon Kim,

Bathri Vajravelu

JMIR Formative Research, Journal Year: 2024, Volume and Issue: 9, P. e51319 - e51319

Published: Sept. 3, 2024

The integration of large language models (LLMs), as seen with the generative pretrained transformers series, into health care education and clinical management represents a transformative potential. practical use current LLMs in sparks great anticipation for new avenues, yet its embracement also elicits considerable concerns that necessitate careful deliberation. This study aims to evaluate application state-of-the-art education, highlighting following shortcomings areas requiring significant urgent improvements: (1) threats academic integrity, (2) dissemination misinformation risks automation bias, (3) challenges information completeness consistency, (4) inequity access, (5) algorithmic (6) exhibition moral instability, (7) technological limitations plugin tools, (8) lack regulatory oversight addressing legal ethical challenges. Future research should focus on strategically persistent highlighted this paper, opening door effective measures can improve their education.

Language: Английский

Citations

3

Comparison of the Performance of ChatGPT, Claude and Bard in Support of Myopia Prevention and Control DOI Creative Commons

Yan Wang,

Lihua Liang,

R. Li

et al.

Journal of Multidisciplinary Healthcare, Journal Year: 2024, Volume and Issue: Volume 17, P. 3917 - 3929

Published: Aug. 1, 2024

Chatbots, which are based on large language models, increasingly being used in public health. However, the effectiveness of chatbot responses has been debated, and their performance myopia prevention control not fully explored. This study aimed to evaluate three well-known chatbots-ChatGPT, Claude, Bard-in responding health questions about myopia.

Language: Английский

Citations

2

Optimizing Natural Language Processing: A Comparative Analysis of GPT-3.5, GPT-4, and GPT-4o DOI
Manuel Ayala-Chauvín, Fátima Avilés-Castillo

Data & Metadata, Journal Year: 2024, Volume and Issue: 3

Published: Jan. 1, 2024

In the last decade, advancement of artificial intelligence has transformed multiple sectors, with natural language processing standing out as one most dynamic and promising areas. This study focused on comparing GPT-3.5, GPT-4 GPT-4o models, evaluating their efficiency performance in Natural Language Processing tasks such text generation, machine translation sentiment analysis. Using a controlled experimental design, response speed quality outputs generated by each model were measured. The results showed that significantly outperforms terms speed, completing 25% faster generation 20% translation. analysis, was 30% than GPT-4. Additionally, analysis quality, assessed using human reviews, while GPT-3.5 delivers fast consistent responses, produce higher more de-tailed content. findings suggest is ideal for applications require consistency, GPT-4, although slower, might be preferred contexts where accuracy are important. highlights need to balance selection models suggests implementing additional automatic evaluations future research complement current

Language: Английский

Citations

2