ChatGPT as a Source for Patient Information on Patellofemoral Surgery—A Comparative Study Amongst Laymen, Doctors, and Experts DOI Creative Commons
Andreas Frodl, Andreas Fuchs, Tayfun Yilmaz

et al.

Clinics and Practice, Journal Year: 2024, Volume and Issue: 14(6), P. 2376 - 2384

Published: Nov. 5, 2024

In November 2022, OpenAI launched ChatGPT for public use through a free online platform. is an artificial intelligence (AI) chatbot trained on broad dataset encompassing wide range of topics, including medical literature. The usability in the field and quality AI-generated responses are widely discussed subject current investigations. Patellofemoral pain one most common conditions among young adults, often prompting patients to seek advice. This study examines as source information regarding patellofemoral surgery, hypothesizing that there will be differences evaluation generated by between populations with different levels expertise disorders.

Language: Английский

Can Large Language Models Replace Coding Specialists? Evaluating GPT Performance in Medical Coding Tasks DOI Creative Commons
Yeli Feng

Research Square (Research Square), Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 8, 2025

Abstract Purpose: Large language Models (LLM), GPT in particular, have demonstrated near human-level performance medical domain, from summarizing clinical notes and passing licensing examinations, to predictive tasks such as disease diagnoses treatment recommendations. However, currently there is little research on their efficacy for coding, a pivotal component health informatics, trials, reimbursement management. This study proposes prompt framework investigates its effectiveness coding tasks. Methods: First, proposed. aims improve the of complex by leveraging state-of-the-art (SOTA) techniques including meta prompt, multi-shot learning, dynamic in-context learning extract task specific knowledge. implemented with combination commercial GPT-4o open-source LLM. Then evaluated three different Finally, ablation studies are presented validate analyze contribution each module proposed framework. Results: On MIMIC-IV dataset, prediction accuracy 68.1% over 30 most frequent MS-DRG codes. The result comparable SOTA 69.4% that fine-tunes LLaMA model, best our And top-5 90.0%. trial criteria results macro F1 score 68.4 CHIP-CTC test dataset Chinese, close 70.9, supervised model training method comparison. For less semantic task, 79.7 CHIP-STS which not competitive methods Conclusion: This demonstrates tasks, carefully designed prompt-based can achieve similar approaches. Currently, it be very helpful assistants, but does replace human specialists. With rapid advancement LLM, potential reliably automate future cannot underestimated.

Language: Английский

Citations

0

Comparative Evaluation of Chatbot Responses on Coronary Artery Disease DOI Creative Commons
Levent Pay, Ahmet Çağdaş Yumurtaş,

Tuğba Çetin

et al.

Turk Kardiyoloji Dernegi Arsivi-Archives of the Turkish Society of Cardiology, Journal Year: 2025, Volume and Issue: unknown, P. 35 - 43

Published: Jan. 1, 2025

Objective: Coronary artery disease (CAD) is the leading cause of morbidity and mortality globally.The growing interest in natural language processing chatbots (NLPCs) has driven their inevitable widespread adoption healthcare.The purpose this study was to evaluate accuracy reproducibility responses provided by NLPCs, such as ChatGPT, Gemini, Bing, frequently asked questions about CAD.Methods: Fifty CAD were twice, with a one-week interval, on Bing.Two cardiologists independently scored answers into four categories: comprehensive/correct (1), incomplete/partially correct (2), mix accurate inaccurate/misleading (3), completely inaccurate/irrelevant (4).The each NLPC's assessed.Results: ChatGPT's 14% 86% comprehensive/correct.In contrast, Gemini 68% responses, 30% 2% inaccurate/ misleading information.Bing delivered 60% 26% 8% information.Reproducibility scores 88% for 84% 70% Bing. Conclusion:ChatGPT demonstrates significant potential improve patient education coronary providing more sensitive compared Bing Gemini.

Language: Английский

Citations

0

Effectiveness of Large Language Models in Stroke Rehabilitation Health Education: A Comparative Study of ChatGPT-4, MedGo, Qwen, and ERNIE Bot (Preprint) DOI Creative Commons

Shiqi Qiang,

Yang Liao,

Yongchun Gu

et al.

Published: Feb. 28, 2025

BACKGROUND Stroke is a leading cause of disability and death worldwide, with home-based rehabilitation playing crucial role in improving patient prognosis quality life. Traditional health education models often fall short terms precision, personalization, accessibility. In contrast, large language (LLMs) are gaining attention for their potential medical education, owing to advanced natural processing capabilities. However, the effectiveness LLMs stroke remains uncertain. OBJECTIVE This study evaluates four LLMs—ChatGPT-4, MedGo, Qwen, ERNIE Bot—in rehabilitation. The aim offer patients more precise secure pathways while exploring feasibility using guide education. METHODS first phase this study, literature review expert interviews identified 15 common questions 2 clinical cases relevant These were input into simulated consultations. Six experts (2 clinicians, nursing specialists, therapists) evaluated LLM-generated responses Likert 5-point scale, assessing accuracy, completeness, readability, safety, humanity. second phase, top two performing from one selected. Thirty undergoing recruited. Each asked both 3 questions, rated satisfaction assessed text length, recommended reading age Chinese readability analysis tool. Data analyzed one-way ANOVA, post hoc Tukey HSD tests, paired t-tests. RESULTS results revealed significant differences across five dimensions: accuracy (P = .002), completeness < .001), .04), safety .007), humanity .001). ChatGPT-4 outperformed all each dimension, scores (M 4.28, SD 0.84), 4.35, 0.75), 0.85), 4.38, 0.81), user-friendliness 4.65, 0.66). MedGo excelled 4.06, 0.78) 0.74). Qwen Bot scored significantly lower dimensions compared MedGo. generated longest 1338.35, 236.03) had highest score 12.88). performed best overall, provided clearest responses. CONCLUSIONS have shown strong performance demonstrating real-world applications. further improvements needed professionalism, oversight.

Language: Английский

Citations

0

Accuracy of online symptom assessment applications, large language models, and laypeople for self–triage decisions DOI Creative Commons
Marvin Kopka, Niklas von Kalckreuth, Markus A. Feufel

et al.

npj Digital Medicine, Journal Year: 2025, Volume and Issue: 8(1)

Published: March 25, 2025

Abstract Symptom-Assessment Application (SAAs, e.g., NHS 111 online) that assist laypeople in deciding if and where to seek care ( self-triage ) are gaining popularity Large Language Models (LLMs) increasingly used too. However, there is no evidence synthesis on the accuracy of LLMs, review has contextualized SAAs LLMs. This systematic evaluates both LLMs compares them laypeople. A total 1549 studies were screened 19 included. The was moderate but highly variable (11.5–90.0%), while (57.8–76.0%) (47.3–62.4%) with low variability. Based available evidence, use or should neither be universally recommended nor discouraged; rather, we suggest their utility assessed based specific case user group under consideration.

Language: Английский

Citations

0

Large Language Models’ Responses to Spinal Cord Injury: A Comparative Study of Performance DOI
Jinze Li, Chao Chang, Yanqiu Li

et al.

Journal of Medical Systems, Journal Year: 2025, Volume and Issue: 49(1)

Published: March 25, 2025

Language: Английский

Citations

0

Evaluating Sex and Age Biases in Multimodal Large Language Models for Skin Disease Identification from Dermatoscopic Images DOI Creative Commons
Zhiyu Wan, Yuhang Guo, Shunxing Bao

et al.

Health Data Science, Journal Year: 2025, Volume and Issue: 5

Published: Jan. 1, 2025

Background: Multimodal large language models (LLMs) have shown potential in various health-related fields. However, many healthcare studies raised concerns about the reliability and biases of LLMs applications. Methods: To explore practical application multimodal skin disease identification, to evaluate sex age biases, we tested performance 2 popular LLMs, ChatGPT-4 LLaVA-1.6, across diverse groups using a subset dermatoscopic dataset containing around 10,000 images 3 diseases (melanoma, melanocytic nevi, benign keratosis-like lesions). Results: In comparison deep learning (VGG16, ResNet50, Model Derm) based on convolutional neural network (CNN), one vision transformer model (Swin-B), found that LLaVA-1.6 demonstrated overall accuracies were 3% 23% higher (and F1-scores 4% 34% higher), respectively, than best performing CNN-based baseline while maintaining 38% 26% lower 19% lower), Swin-B. Meanwhile, is generally unbiased identifying these groups, contrast Swin-B, which biased nevi. Conclusions: This study suggests usefulness fairness dermatological applications, aiding physicians practitioners with diagnostic recommendations patient screening. further verify healthcare, experiments larger more datasets need be performed future.

Language: Английский

Citations

0

Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots DOI Creative Commons
Himel Mondal,

Devendra Nath Tiu,

Shaikat Mondal

et al.

Journal of Mid-life Health, Journal Year: 2025, Volume and Issue: 16(1), P. 45 - 50

Published: Jan. 1, 2025

A BSTRACT Background: The use of large language model (LLM) chatbots in health-related queries is growing due to their convenience and accessibility. However, concerns about the accuracy readability information persist. Many individuals, including patients healthy adults, may rely on for midlife health instead consulting a doctor. In this context, we evaluated responses from six LLM questions men women. Methods: Twenty were asked different – ChatGPT, Claude, Copilot, Gemini, Meta artificial intelligence (AI), Perplexity. Each chatbot’s collected accuracy, relevancy, fluency, coherence by three independent expert physicians. An overall score was also calculated taking average four criteria. addition, analyzed using Flesch-Kincaid Grade Level, determine how easily could be understood general population. Results: terms Perplexity scored highest (4.3 ± 1.78), AI (4.26 0.16), AI, relevancy (4.35 0.24). Overall, (4.28 followed ChatGPT (4.22 0.21), whereas Copilot had lowest (3.72 0.19) ( P < 0.0001). showed 41.24 10.57 grade level (11.11 1.93), meaning its text easiest read requires lower education. Conclusion: can answer midlife-related with variable capabilities. found scoring chatbot addressing men’s women’s questions, offers high accessible information. Hence, used as educational tools selecting appropriate according capability.

Language: Английский

Citations

0

Large Language Model Use Cases in Healthcare Research are Redundant and Often Lack Appropriate Methodological Conduct: A Scoping Review and Call for Improved Practices DOI
Kyle N. Kunze, Cameron Gerhold, Udit Dave

et al.

Arthroscopy The Journal of Arthroscopic and Related Surgery, Journal Year: 2025, Volume and Issue: unknown

Published: April 1, 2025

Language: Английский

Citations

0

Assessing the Accuracy, Completeness and Safety of ChatGPT-4o Responses on Pressure Injuries in Infants: Clinical Applications and Future Implications DOI Creative Commons

Marica Soddu,

Andrea De Vito, Giordano Madeddu

et al.

Nursing Reports, Journal Year: 2025, Volume and Issue: 15(4), P. 130 - 130

Published: April 14, 2025

Background/Objectives: The advent of large language models (LLMs), like platforms such as ChatGPT, capable generating quick and interactive answers to complex questions, opens the way for new approaches training healthcare professionals, enabling them acquire up-to-date specialised information easily. In nursing, they have proven support clinical decision making, continuing education, development care plans management cases, well writing academic reports scientific articles. Furthermore, ability provide rapid access can improve quality promote evidence-based practice. However, their applicability in practice requires thorough evaluation. This study evaluated accuracy, completeness safety responses generated by ChatGPT-4 on pressure injuries (PIs) infants. Methods: January 2025, we analysed 60 queries, subdivided into 12 main topics, PIs questions were developed, through consultation authoritative documents, based relevance nursing potential. A panel five experts, using a 5-point Likert scale, assessed ChatGPT. Results: Overall, over 90% ChatGPT-4o received relatively high ratings three criteria with most frequent value 4. when analysing topics individually, observed that Medical Device Management Technological Innovation lowest accuracy scores. At same time, Scientific Evidence had No rated completely incorrect. Conclusions: has shown good level addressing about ongoing updates integration high-quality sources are essential ensuring its reliability decision-support tool.

Language: Английский

Citations

0

A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions DOI
Asma Musabah Alkalbani, Ahmed Salim Alrawahi, Ahmad Salah

et al.

Research Square (Research Square), Journal Year: 2025, Volume and Issue: unknown

Published: April 16, 2025

Abstract Background: Large Language Models (LLMs) are one of the artificial intelligence (AI) technologies used to understand and generate text, summarize information, comprehend contextual cues. LLMs have been increasingly by researchers in various medical applications, but their effectiveness limitations still uncertain, especially across specialties. Objective: This review evaluates recent literature on how utilized research studies 19 It also explores challenges involved suggests areas for future focus. Methods: Two performed searches PubMed, Web Science Scopus identify published from January 2021 March 2024. The included usage LLM performing tasks. Data was extracted analyzed five reviewers. To assess risk bias, quality assessment using revised tool intelligence-centered diagnostic accuracy (QUADAS-AI). Results: Results were synthesized through categorical analysis evaluation metrics, impact types, validation approaches A total 84 this mainly originated two countries; USA (35/84) China (16/84). Although reviewed applications spread specialties, multi-specialty demonstrated 22 studies. Various aims include clinical natural language processing (31/84), supporting decision (20/84), education (15/84), diagnoses patient management engagement (3/84). GPT-based BERT-based most (83/84) Despite reported positive impacts such as improved efficiency accuracy, related reliability, ethics remain. overall bias low 72 studies, high 11 not clear 3 Conclusion: dominate specialty with over 98.8% these models. potential benefits process diagnostics, a key finding regarding substantial variability performance among LLMs. For instance, LLMs' ranged 3% support 90% some NLP Heterogeneity utilization diverse tasks contexts prevented meaningful meta-analysis, lacked standardized methodologies, outcome measures, implementation approaches. Therefore, room improvement remains wide developing domain-specific data establishing standards ensure reliability effectiveness.

Language: Английский

Citations

0