Comparing AI-Generated and Clinician-Created Personalized Self-Management Guidance for Knee Osteoarthritis Patients: A Blinded Observational Study (Preprint) DOI Creative Commons
Kai Du, Ao Li, Qi Zuo

et al.

Published: Oct. 22, 2024

BACKGROUND Background: Personalized education is crucial for effective knee osteoarthritis (OA) management, but providing it remains challenging due to imbalanced patient-provider ratio and limited resources. This study explores the potential of GPT-4, a large language model, in generating tailored self-management guidance compares its performance with physician-provided advice. OBJECTIVE Objective:This aims evaluate effectiveness GPT-4 personalized materials patients compare experienced clinicians. Specifically, comparison made terms efficiency, readability, accuracy, personalization, comprehensiveness, safety. By leveraging patient data from previous trials, evaluated whether AI can improve quality accuracy improving care outcomes. METHODS Methods: A two-phase, blinded, observational was conducted using trial. In phase one, two orthopedic specialists created materials. two, same were input into by physician generate content. Materials efficiency (words per minute), readability (Flesch-Kincaid Grade Level, Gunning Fog Index, Coleman-Liau SMOG Index), personality, RESULTS Results: demonstrated higher than clinicians (median 530.03 vs. 37.29 WPM, P < 0.001). content exhibited superior on Flesch-Kincaid grade level, Index (P Expert evaluations revealed that outperformed (5.307 ± 1.731 4.76 1.098, = 0.047), personality (54.32 6.212 33.2 5.395, 0.001), comprehensiveness (51.74 6.471 35.26 6.657, safety 61 50, CONCLUSIONS Conclusion: shows promise high-quality, OA, surpassing human experts. provides novel evidence enabling precise intelligent education. Further research needed validate findings larger populations assess impact

Language: Английский

The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study DOI Open Access
Keiichi Ohta,

Satomi Ohta

Cureus, Journal Year: 2023, Volume and Issue: unknown

Published: Dec. 12, 2023

Purpose This study aims to evaluate the performance of three large language models (LLMs), Generative Pre-trained Transformer (GPT)-3.5, GPT-4, and Google Bard, on 2023 Japanese National Dentist Examination (JNDE) assess their potential clinical applications in Japan. Methods A total 185 questions from JNDE were used. These categorized by question type category. McNemar's test compared correct response rates between two LLMs, while Fisher's exact evaluated LLMs each Results The overall 73.5% for 66.5% 51.9% GPT-3.5. GPT-4 showed a significantly higher rate than Bard In category essential questions, achieved 80.5%, surpassing passing criterion 80%. contrast, both GPT-3.5 fell short this benchmark, with attaining 77.6% only 52.5%. scores that (p<0.01). For general 71.2% 58.5% 52.5% outperformed professional dental 51.6% 45.3% 35.9% differences among not statistically significant. All demonstrated lower accuracy dentistry other types Conclusions highest score JNDE, followed However, surpassed questions. To further understand application worldwide, more research examinations across different languages is required.

Language: Английский

Citations

21

The performance of artificial intelligence large language model-linked chatbots in surgical decision-making for gastroesophageal reflux disease DOI
Bright Huo,

Elisa Calabrese,

Patricia Sylla

et al.

Surgical Endoscopy, Journal Year: 2024, Volume and Issue: 38(5), P. 2320 - 2330

Published: April 17, 2024

Language: Английский

Citations

8

Large language models in healthcare: from a systematic review on medical examinations to a comparative analysis on fundamentals of robotic surgery online test DOI Creative Commons
Andrea Moglia, Κωνσταντίνος Γεωργίου, Pietro Cerveri

et al.

Artificial Intelligence Review, Journal Year: 2024, Volume and Issue: 57(9)

Published: Aug. 6, 2024

Abstract Large language models (LLMs) have the intrinsic potential to acquire medical knowledge. Several studies assessing LLMs on examinations been published. However, there is no reported evidence tests related robot-assisted surgery. The aims of this study were perform first systematic review and establish whether ChatGPT, GPT-4, Bard can pass Fundamentals Robotic Surgery (FRS) didactic test. A literature search was performed PubMed, Web Science, Scopus, arXiv following Preferred Reporting Items for Systematic Reviews Meta-Analyses (PRISMA) approach. total 45 analyzed. GPT-4 passed several national qualifying with questions in English, Chinese, Japanese using zero-shot few-shot learning. Med-PaLM 2 obtained similar scores United States Medical Licensing Examination more refined prompt engineering techniques. Five different 2023 releases one tested FRS. Seven attempts each release. score 79.5%. ChatGPT achieved a mean 64.6%, 65.6%, 75.0%, 78.9%, 72.7% respectively from fifth release FRS vs 91.5% 79.5% Bard. outperformed all corresponding statistically significant difference (p < 0.001), but not = 0.002). Our findings agree other included review. We highlighted challenges transform education healthcare professionals stages learning, by assisting teachers preparation teaching contents, trainees acquisition knowledge, up becoming an assessment framework leaners.

Language: Английский

Citations

8

AI Versus MD: Evaluating the surgical decision-making accuracy of ChatGPT-4 DOI

Deanna Palenzuela,

John T. Mullen, Roy Phitayakorn

et al.

Surgery, Journal Year: 2024, Volume and Issue: 176(2), P. 241 - 245

Published: May 19, 2024

Language: Английский

Citations

7

Artificial Intelligence and ChatGPT in Abdominopelvic Surgery: A Systematic Review of Applications and Impact DOI Open Access
Marta Goglia, Marco Pace,

Marco Yusef

et al.

In Vivo, Journal Year: 2024, Volume and Issue: 38(3), P. 1009 - 1015

Published: Jan. 1, 2024

Background/Aim: The integration of AI and natural language processing technologies, such as ChatGPT, into surgical practice has shown promising potential in enhancing various aspects abdominopelvic procedures. This systematic review aims to comprehensively evaluate the current state research on applications impact artificial intelligence (AI) ChatGPT surgery summarizing existing literature towards providing a comprehensive overview diverse applications, effectiveness, challenges, future directions these innovative technologies. Materials Methods: A search major electronic databases, including PubMed, Google Scholar, Cochrane Library, Web Science, was conducted from October November 2023, identify relevant studies. Inclusion criteria encompassed studies that investigated utilization settings, including, but not limited preoperative planning, intraoperative decision-making, postoperative care, patient communication. Results: Fourteen met inclusion were included this review. majority analysing ChatGPT's data output decision making while two reported general resident perception tool applied clinical practice. Most high accuracy decision-making process, however with an unforgettable number errors. Conclusion: contributes understanding role surgery, insight their synthesis available evidence will inform directions, guidelines, development technologies optimize benefits care within domain.

Language: Английский

Citations

5

Validation of a Deep Learning Chest X-ray Interpretation Model: Integrating Large-Scale AI and Large Language Models for Comparative Analysis with ChatGPT DOI Creative Commons
Kyu Hong Lee, Ro Woon Lee,

Ye Eun Kwon

et al.

Diagnostics, Journal Year: 2023, Volume and Issue: 14(1), P. 90 - 90

Published: Dec. 30, 2023

This study evaluates the diagnostic accuracy and clinical utility of two artificial intelligence (AI) techniques: Kakao Brain Artificial Neural Network for Chest X-ray Reading (KARA-CXR), an assistive technology developed using large-scale AI large language models (LLMs), ChatGPT, a well-known LLM. The was conducted to validate performance technologies in chest reading explore their potential applications medical imaging diagnosis domain. methodology consisted randomly selecting 2000 images from single institution's patient database, radiologists evaluated readings provided by KARA-CXR ChatGPT. used five qualitative factors evaluate generated each model: accuracy, false findings, location inaccuracies, count hallucinations. Statistical analysis showed that achieved significantly higher compared In 'Acceptable' category, rated at 70.50% 68.00% observers, while ChatGPT 40.50% 47.00%. Interobserver agreement moderate both systems, with KARA 0.74 GPT4 0.73. For 'False Findings', scored 68.50%, 37.00% high interobserver agreements 0.96 0.97 GPT4. 'Location Inaccuracy' 'Hallucinations', outperformed significant margins. demonstrated non-hallucination rate 75%, which is than ChatGPT's 38%. (0.91) (0.85) hallucination category. conclusion, this demonstrates diagnostics. It also shows domain, has relatively

Language: Английский

Citations

13

ChatGPT’s Accuracy on Magnetic Resonance Imaging Basics: Characteristics and Limitations Depending on the Question Type DOI Creative Commons
Kyu Hong Lee, Ro Woon Lee

Diagnostics, Journal Year: 2024, Volume and Issue: 14(2), P. 171 - 171

Published: Jan. 12, 2024

Our study aimed to assess the accuracy and limitations of ChatGPT in domain MRI, focused on evaluating ChatGPT's performance answering simple knowledge questions specialized multiple-choice related MRI. A two-step approach was used evaluate ChatGPT. In first step, 50 MRI-related were asked, answers categorized as correct, partially or incorrect by independent researchers. second 75 covering various MRI topics posed, similarly categorized. The utilized Cohen's kappa coefficient for assessing interobserver agreement. demonstrated high straightforward questions, with over 85% classified correct. However, its varied significantly across rates ranging from 40% 66.7%, depending topic. This indicated a notable gap ability handle more complex, requiring deeper understanding context. conclusion, this critically evaluates addressing Magnetic Resonance Imaging (MRI), highlighting potential healthcare sector, particularly radiology. findings demonstrate that ChatGPT, while proficient responding exhibits variability accurately answer complex require profound, discrepancy underscores nuanced role AI can play medical education decision-making, necessitating balanced application.

Language: Английский

Citations

4

Assessing the performance of Microsoft Copilot, GPT-4 and Google Gemini in ophthalmology DOI Creative Commons

Meziane Silhadi,

Wissam B. Nassrallah, David Mikhail

et al.

Canadian Journal of Ophthalmology, Journal Year: 2025, Volume and Issue: unknown

Published: Feb. 1, 2025

To evaluate the performance of large language models (LLMs), specifically Microsoft Copilot, GPT-4 (GPT-4o and GPT-4o mini), Google Gemini (Gemini Advanced), in answering ophthalmological questions assessing impact prompting techniques on their accuracy. Prospective qualitative study. Advanced). A total 300 from StatPearls were tested, covering a range subspecialties image-based tasks. Each question was evaluated using 2 techniques: zero-shot forced (prompt 1) combined role-based plan-and-solve+ 2). With prompting, demonstrated significantly superior overall performance, correctly 72.3% outperforming all other models, including Copilot (53.7%), mini (62.0%), (54.3%), Advanced (62.0%) (p < 0.0001). Both showed notable improvements with Prompt over 1, elevating Copilot's accuracy lowest (53.7%) to second highest (72.3%) among LLMs. While newer iterations LLMs, such as Advanced, outperformed less advanced counterparts Gemini), this study emphasizes need for caution clinical applications these models. The choice influences highlighting necessity further research refine LLMs capabilities, particularly visual data interpretation, ensure safe integration into medical practice.

Language: Английский

Citations

0

Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions DOI Creative Commons
Jean‐Paul Bereuter,

Mark Enrik Geissler,

Anna Klimová

et al.

Journal of surgical education, Journal Year: 2025, Volume and Issue: 82(4), P. 103442 - 103442

Published: Feb. 9, 2025

Recent studies investigated the potential of large language models (LLMs) for clinical decision making and answering exam questions based on text input. developments LLMs have extended these with vision capabilities. These image processing are called vision-language (VLMs). However, there is limited investigation applicability VLMs their capabilities content. Therefore, aim this study was to examine performance publicly accessible in 2 different surgical question sets consisting questions. Original from subsets German Medical Licensing Examination (GMLE) United States (USMLE) were collected answered by available (GPT-4, Claude-3 Sonnet, Gemini-1.5). LLM outputs benchmarked accuracy Additionally, LLMs' compared students' average historical (AHP) exams. Moreover, variations analyzed relation difficulty respective type. Overall, all achieved scores equivalent passing grades (≥60%) across both datasets. On image-based questions, only GPT-4 exceeded score required pass, significantly outperforming Gemini-1.5 (GPT: 78% vs. Claude-3: 58% Gemini-1.5: 57.3%; p < 0.001). outperformed students 83.7% AHP students: 67.8%; 0.001) 67.4%; demonstrated substantial it holds considerable use education trainee surgeons.

Language: Английский

Citations

0

Evaluating ChatGPT-4's performance on oral and maxillofacial queries: Chain of Thought and standard method DOI Creative Commons
Kaiyuan Ji, Z. D. Wu, Jing Han

et al.

Frontiers in Oral Health, Journal Year: 2025, Volume and Issue: 6

Published: Feb. 12, 2025

Objectives Oral and maxillofacial diseases affect approximately 3.5 billion people worldwide. With the continuous advancement of Artificial Intelligence technologies, particularly application generative pre-trained transformers like ChatGPT-4, there is potential to enhance public awareness prevention early detection these diseases. This study evaluated performance ChatGPT-4 in addressing oral disease questions using standard approaches Chain Thought (CoT) method, aiming gain a deeper understanding its capabilities, potential, limitations. Materials methods Three experts, drawing from their extensive experience most common clinical settings, selected 130 open-ended 1,805 multiple-choice national dental licensing examination. These encompass 12 areas surgery, including Prosthodontics, Pediatric Dentistry, Maxillofacial Tumors Salivary Gland Diseases, Infections. Results Using CoT approach, exhibited marked enhancements accuracy, structure, completeness, professionalism, overall impression for questions, revealing statistically significant differences compared on general inquiries. In realm method boosted ChatGPT-4's accuracy across all major subjects, achieving an increase 3.1%. Conclusions When employing address incorporating as querying can help improve such issues. However, it not advisable consider substitute doctors.

Language: Английский

Citations

0