Evaluation of Chat Generative Pre-trained Transformer and Microsoft Copilot Performance on the American Society of Surgery of the Hand Self-Assessment Examinations DOI Creative Commons

Taylor R. Rakauskas,

António Costa,

Claudio Moriconi

et al.

Journal of Hand Surgery Global Online, Journal Year: 2024, Volume and Issue: 7(1), P. 23 - 28

Published: Nov. 13, 2024

Artificial intelligence advancements have the potential to transform medical education and patient care. The increasing popularity of large language models has raised important questions regarding their accuracy agreement with human users. purpose this study was evaluate performance Chat Generative Pre-Trained Transformer (ChatGPT), versions 3.5 4, as well Microsoft Copilot, which is powered by ChatGPT-4, on self-assessment examination for hand surgery compare results between versions. Input included 1,000 across 5 years (2015-2019) examinations provided American Society Surgery Hand. primary outcomes correctness, percentage concordance relative other users, whether an additional prompt required. Secondary according question type difficulty. All formats including image-based were used analysis. ChatGPT-3.5 correctly answered 51.6% ChatGPT-4 63.4%, a statistically significant difference. Copilot 59.9% outperformed but scored significantly lower than ChatGPT-4. However, sided average 72.2% users when correct 62.1% incorrect, compared 67.0% 53.2% respectively, 79.7% 52.1% incorrect. highest scoring subject Miscellaneous, lowest Neuromuscular in all In study, perform better subspecialty did ChatGPT-3.5. more accurate ChatGPT3.5 less ChatGPT4. able "pass" 2015-2019 Hand examinations. While holding promise within education, caution should be detailed evaluation consistency needed. Future studies explore how these multiple trials contexts truly assess reliability.

Language: Английский

The Performance of ChatGPT on the American Society for Surgery of the Hand Self-Assessment Examination DOI Open Access
S. Arango,

Jason C Flynn,

Jacob Zeitlin

et al.

Cureus, Journal Year: 2024, Volume and Issue: unknown

Published: April 24, 2024

This study aims to compare the performance of ChatGPT-3.5 (GPT-3.5) and ChatGPT-4 (GPT-4) on American Society for Surgery Hand (ASSH) Self-Assessment Examination (SAE) determine their potential as educational tools.

Language: Английский

Citations

10

AI in Hand Surgery: Assessing Large Language Models in the Classification and Management of Hand Injuries DOI Open Access
Sophia M. Pressman, Sahar Borna, Cesar A. Gomez-Cabello

et al.

Journal of Clinical Medicine, Journal Year: 2024, Volume and Issue: 13(10), P. 2832 - 2832

Published: May 11, 2024

Background: OpenAI's ChatGPT (San Francisco, CA, USA) and Google's Gemini (Mountain View, are two large language models that show promise in improving expediting medical decision making hand surgery. Evaluating the applications of these within field surgery is warranted. This study aims to evaluate ChatGPT-4 classifying injuries recommending treatment. Methods: were given 68 fictionalized clinical vignettes twice. The asked use a specific classification system recommend surgical or nonsurgical Classifications scored based on correctness. Results analyzed using descriptive statistics, paired two-tailed t-test, sensitivity testing. Results: Gemini, correctly 70.6% injuries, demonstrated superior ability over (mean score 1.46 vs. 0.87, p-value < 0.001). For management, higher intervention compared (98.0% 88.8%), but lower specificity (68.4% 94.7%). When ChatGPT, greater response replicability. Conclusions: Large like assisting making, particularly surgery, with generally outperforming ChatGPT. These findings emphasize importance considering strengths limitations different when integrating them into practice.

Language: Английский

Citations

8

Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review DOI Creative Commons

Cindy Ho,

Tiffany Tian,

Alessandra T. Ayers

et al.

BMC Medical Informatics and Decision Making, Journal Year: 2024, Volume and Issue: 24(1)

Published: Nov. 26, 2024

The large language models (LLMs), most notably ChatGPT, released since November 30, 2022, have prompted shifting attention to their use in medicine, particularly for supporting clinical decision-making. However, there is little consensus the medical community on how LLM performance contexts should be evaluated. We performed a literature review of PubMed identify publications between December 1, and April 2024, that discussed assessments LLM-generated diagnoses or treatment plans. selected 108 relevant articles from analysis. frequently used LLMs were GPT-3.5, GPT-4, Bard, LLaMa/Alpaca-based models, Bing Chat. five criteria scoring outputs "accuracy", "completeness", "appropriateness", "insight", "consistency". defining high-quality been consistently by researchers over past 1.5 years. identified high degree variation studies reported findings assessed performance. Standardized reporting qualitative evaluation metrics assess quality can developed facilitate research healthcare.

Language: Английский

Citations

5

From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance DOI Creative Commons
Markus Kipp

Information, Journal Year: 2024, Volume and Issue: 15(9), P. 543 - 543

Published: Sept. 5, 2024

ChatGPT is a large language model trained on increasingly datasets to perform diverse language-based tasks. It capable of answering multiple-choice questions, such as those posed by medical examinations. has been generating considerable attention in both academic and non-academic domains recent months. In this study, we aimed assess GPT’s performance anatomical questions retrieved from licensing examinations Germany. Two different versions were compared. GPT-3.5 demonstrated moderate accuracy, correctly 60–64% the autumn 2022 spring 2021 exams. contrast, GPT-4.o showed significant improvement, achieving 93% accuracy exam 100% exam. When tested 30 unique not available online, maintained 96% rate. Furthermore, consistently outperformed students across six state exams, with statistically mean score 95.54% compared students’ 72.15%. The study demonstrates that outperforms its predecessor, GPT-3.5, cohort students, indicating potential powerful tool education assessment. This improvement highlights rapid evolution LLMs suggests AI could play an important role supporting enhancing training, potentially offering supplementary resources for professionals. However, further research needed limitations practical applications systems real-world practice.

Language: Английский

Citations

4

Comparaison des performances des internes français de chirurgie orthopédique et de l’intelligence artificielle ChatGPT-4/4o aux examens du diplôme d’études spécialisées de chirurgie orthopédique et traumatologique DOI Creative Commons

Nabih Maraqa,

Ramy Samargandi,

A. Poichotte

et al.

Revue de Chirurgie Orthopédique et Traumatologique, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 1, 2025

Citations

0

Capable exam-taker and question-generator: the dual role of generative AI in medical education assessment DOI Creative Commons
Yihong Qiu, Chang Liu

Global Medical Education, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 13, 2025

Abstract Objectives Artificial intelligence (AI) is being increasingly used in medical education. This narrative review presents a comprehensive analysis of generative AI tools’ performance answering and generating exam questions, thereby providing broader perspective on AI’s strengths limitations the education context. Methods The Scopus database was searched for studies examinations from 2022 to 2024. Duplicates were removed, relevant full texts retrieved following inclusion exclusion criteria. Narrative descriptive statistics analyze contents included studies. Results A total 70 analysis. results showed that varied when different types questions specialty with best average accuracy psychiatry, influenced by prompts. With well-crafted prompts, models can efficiently produce high-quality examination questions. Conclusion Generative possesses ability answer using carefully designed Its potential use assessment vast, ranging detecting question error, aiding preparation, facilitating formative assessments, supporting personalized learning. However, it’s crucial educators always double-check responses maintain prevent spread misinformation.

Language: Английский

Citations

0

Matching Human Expertise: ChatGPT’s Performance on Hand Surgery Examinations DOI

Zachary A. Kirschenbaum,

Yuri Han,

Kiera L. Vrindten

et al.

Hand, Journal Year: 2025, Volume and Issue: unknown

Published: March 20, 2025

Background: The integration of artificial intelligence (AI) into health care witnessed significant advancements, particularly with AI-driven tools like ChatGPT. Initial evaluations indicated that ChatGPT 3.5 did not perform as well humans on specialized hand surgery self-assessment examinations. purpose this study is to evaluate the performance 4o American Society for Surgery Hand (ASSH) questions and whether using enhanced techniques such better prompts file search improve accuracy. Methods: Using data from ASSH examinations (2008-2013), we explored impact model version, prompt, accuracy AI-generated responses. We used OpenAI’s application programming interface automate question input response scoring. Statistical analysis was conducted one-way variance. KR-20 assess reliability test. Results: Results indicate latest AI models, prompting access peer-reviewed literature, can achieve levels comparable human examinees, text-based questions. performed significantly than showed marked improvement capabilities. 2013 examination 0.946, indicating a very reliable Conclusions: These findings highlight AI’s potential support medical education practice, demonstrating at human-equivalent level Our results suggest utility supplementary tool in educational settings supportive resource clinical practice.

Language: Английский

Citations

0

Exploring the Current Applications of Artificial Intelligence in Orthopaedic Surgical Training: A Systematic Scoping Review DOI Open Access
Ahmed Al-Saadawi,

Sam Tehranchi,

S. Faisal Ahmed

et al.

Cureus, Journal Year: 2025, Volume and Issue: unknown

Published: April 3, 2025

In recent years, the integration of artificial intelligence (AI) in surgical education has been prominent, as evidenced by publications. Given unique requirements and challenges associated with orthopaedic training, we conducted a systematic scoping review that examined applications AI only this setting. A comprehensive search was across four databases: Embase, CENTRAL, Medline, Scopus. Original research articles utilised an model within specific educational context were considered for inclusion. Data from included studies extracted onto bespoke form, followed thematic analysis to detect patterns data. Our findings then summarised descriptively. total 21 review, encompassing 273 participants. relation two overarching themes identified: refinement competencies enhancement knowledge acquisition. All studies, exception one, last five years. Twelve distinct models large language accounting over half applications. Multiple promising interventions highlighted, particularly use personalised automated feedback evaluating performance tasks. holds major potential revolutionise training. However, evidence supporting its field remains limited. Further preferably randomised controlled trials larger sample sizes, are required strengthen base.

Language: Английский

Citations

0

Large Language Model Use Cases in Healthcare Research are Redundant and Often Lack Appropriate Methodological Conduct: A Scoping Review and Call for Improved Practices DOI
Kyle N. Kunze, Cameron Gerhold, Udit Dave

et al.

Arthroscopy The Journal of Arthroscopic and Related Surgery, Journal Year: 2025, Volume and Issue: unknown

Published: April 1, 2025

Language: Английский

Citations

0

Comparison of Hand Surgery Certification Exams in Europe and the United States Using ChatGPT 4.0 DOI

Salman Hasan,

Kyros Ipaktchi, Nicolás Meyer

et al.

Journal of Hand and Microsurgery, Journal Year: 2025, Volume and Issue: unknown, P. 100258 - 100258

Published: May 1, 2025

Language: Английский

Citations

0