A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions DOI
Asma Musabah Alkalbani, Ahmed Salim Alrawahi, Ahmad Salah

et al.

Research Square (Research Square), Journal Year: 2025, Volume and Issue: unknown

Published: April 16, 2025

Abstract Background: Large Language Models (LLMs) are one of the artificial intelligence (AI) technologies used to understand and generate text, summarize information, comprehend contextual cues. LLMs have been increasingly by researchers in various medical applications, but their effectiveness limitations still uncertain, especially across specialties. Objective: This review evaluates recent literature on how utilized research studies 19 It also explores challenges involved suggests areas for future focus. Methods: Two performed searches PubMed, Web Science Scopus identify published from January 2021 March 2024. The included usage LLM performing tasks. Data was extracted analyzed five reviewers. To assess risk bias, quality assessment using revised tool intelligence-centered diagnostic accuracy (QUADAS-AI). Results: Results were synthesized through categorical analysis evaluation metrics, impact types, validation approaches A total 84 this mainly originated two countries; USA (35/84) China (16/84). Although reviewed applications spread specialties, multi-specialty demonstrated 22 studies. Various aims include clinical natural language processing (31/84), supporting decision (20/84), education (15/84), diagnoses patient management engagement (3/84). GPT-based BERT-based most (83/84) Despite reported positive impacts such as improved efficiency accuracy, related reliability, ethics remain. overall bias low 72 studies, high 11 not clear 3 Conclusion: dominate specialty with over 98.8% these models. potential benefits process diagnostics, a key finding regarding substantial variability performance among LLMs. For instance, LLMs' ranged 3% support 90% some NLP Heterogeneity utilization diverse tasks contexts prevented meaningful meta-analysis, lacked standardized methodologies, outcome measures, implementation approaches. Therefore, room improvement remains wide developing domain-specific data establishing standards ensure reliability effectiveness.

Language: Английский

Can large language models provide accurate and quality information to parents regarding chronic kidney diseases? DOI
Rüya Naz, Okan Akacı, Hakan Erdoğan

et al.

Journal of Evaluation in Clinical Practice, Journal Year: 2024, Volume and Issue: 30(8), P. 1556 - 1564

Published: July 3, 2024

Artificial Intelligence (AI) large language models (LLM) are tools capable of generating human-like text responses to user queries across topics. The use these in various medical contexts is currently being studied. However, the performance and content quality have not been evaluated specific fields.

Language: Английский

Citations

5

Accuracy and Readability of Artificial Intelligence Chatbot Responses to Vasectomy-Related Questions: Public Beware DOI Open Access

Jonathan A Carlson,

Robin Z Cheng,

Alyssa Lange

et al.

Cureus, Journal Year: 2024, Volume and Issue: unknown

Published: Aug. 28, 2024

Purpose Artificial intelligence (AI) has rapidly gained popularity with the growth of ChatGPT (OpenAI, San Francisco, USA) and other large-language model chatbots, these programs have tremendous potential to impact medicine. One important area consequence in medicine public health is that patients may use search answers medical questions. Despite increased utilization AI chatbots by public, there little research assess reliability alternative when queried for information. This study seeks elucidate accuracy readability answering patient questions regarding urology. As vasectomy one most common urologic procedures, this investigates AI-generated responses frequently asked vasectomy-related For study, five popular free-to-access platforms were utilized undertake investigation. Methods Fifteen individually from November-December 2023: USA), Bard (Google Inc., Mountainview, Bing (Microsoft, Redmond, Perplexity (Perplexity Claude (Anthropic, USA). Responses each platform graded two attending urologists, urology faculty, urological resident physician using a Likert (1-6) scale: (1-completely inaccurate, 6-completely accurate) based on comparison existing American Urological Association guidelines. Flesch-Kincaid Grade levels (FKGL) Flesch Reading Ease scores (FRES) (1-100) calculated response. To differences Likert, FRES, FKGL, Kruskal-Wallis tests performed GraphPad Prism V10.1.0 (GraphPad, Diego, Alpha set at 0.05. Results Analysis shows provided accurate across an average score 5.04 scale. Subsequently, Microsoft (4.91), Anthropic (4.65), Google (4.43), (4.41) followed. All found score, average, higher than 4.41 corresponding least "somewhat accurate." received highest (49.67) lowest level (10.1) compared chatbots. scored 46.7 FRES 10.55 FKGL. 45.57 11.56 36.4 13.29 had 30.4 FKGL 14.2. Conclusion medicine, specifically urology, it helps determine whether can be reliable sources freely available able achieve accurate" 6-point In terms readability, all less 50 10th-grade level. small-scale several significant identified between chatbot. However, no among their accuracies. Thus, our suggests major perform similarly ability correct but differ ease being comprehended general public.

Language: Английский

Citations

4

Performance of artificial intelligence chatbots in responding to the frequently asked questions of patients regarding dental prostheses DOI Creative Commons

Hossein Esmailpour,

Vanya Rasaie, Yasamin Babaee Hemmati

et al.

BMC Oral Health, Journal Year: 2025, Volume and Issue: 25(1)

Published: April 15, 2025

Artificial intelligence (AI) chatbots are increasingly used in healthcare to address patient questions by providing personalized responses. Evaluating their performance is essential ensure reliability. This study aimed assess the of three AI responding frequently asked (FAQs) patients regarding dental prostheses. Thirty-one were collected from accredited organizations' websites and "People Also Ask" feature Google, focusing on removable fixed prosthodontics. Two board-certified prosthodontists evaluated response quality using modified Global Quality Score (GQS) a 5-point Likert scale. Inter-examiner agreement was assessed weighted kappa. Readability measured Flesch-Kincaid Grade Level (FKGL) Flesch Reading Ease (FRE) indices. Statistical analyses performed repeated measures ANOVA Friedman test, with Bonferroni correction for pairwise comparisons (α = 0.05). The inter-examiner good. Among chatbots, Google Gemini had highest score (4.58 ± 0.50), significantly outperforming Microsoft Copilot (3.87 0.89) (P =.004). analysis showed ChatGPT (10.45 1.26) produced more complex responses compared (7.82 1.19) (8.38 1.59) <.001). FRE scores indicated that ChatGPT's categorized as fairly difficult (53.05 7.16), while Gemini's plain English (64.94 7.29), significant difference between them show great potential answering inquiries about However, improvements needed enhance effectiveness education tools.

Language: Английский

Citations

0

Evaluation of different artificial intelligence applications in responding to regenerative endodontic procedures DOI Creative Commons

Ece Ekmekci,

Parla Meva Durmazpınar

BMC Oral Health, Journal Year: 2025, Volume and Issue: 25(1)

Published: Jan. 11, 2025

Language: Английский

Citations

0

Evaluating AI Chatbot Responses to Postkidney Transplant Inquiries DOI
Yang Zhan, Xutao Chen,

F. Ye

et al.

Transplantation Proceedings, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 1, 2025

Language: Английский

Citations

0

Evaluation of Chatbots in the Emergency Management of Avulsion Injuries DOI Creative Commons
Şeyma Mustuloğlu, Büşra Pınar Deniz

Dental Traumatology, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 24, 2025

ABSTRACT Background This study assessed the accuracy and consistency of responses provided by six Artificial Intelligence (AI) applications, ChatGPT version 3.5 (OpenAI), 4 4.0 Perplexity (Perplexity.AI), Gemini (Google), Copilot (Bing), to questions related emergency management avulsed teeth. Materials Methods Two pediatric dentists developed 18 true or false regarding dental avulsion asked public chatbots for 3 days. The were recorded compared with correct answers. SPSS program was used calculate obtained accuracies their consistency. Results achieved highest rate 95.6% over entire time frame, while (Perplexity.AI) had lowest 67.2%. (OpenAI) only AI that perfect agreement real answers, except at noon on day 1. showed weakest (6 times). Conclusions With exception ChatGPT's paid version, 4.0, do not seem ready use as main resource in managing teeth during emergencies. It might prove beneficial incorporate International Association Dental Traumatology (IADT) guidelines chatbot databases, enhancing

Language: Английский

Citations

0

Comparing answers of ChatGPT and Google Gemini to common questions on benign anal conditions DOI

Christophe Maron,

Sameh Hany Emile, Nir Horesh

et al.

Techniques in Coloproctology, Journal Year: 2025, Volume and Issue: 29(1)

Published: Jan. 26, 2025

Language: Английский

Citations

0

Evaluating the Quality and Readability of Generative Artificial Intelligence (AI) Chatbot Responses in the Management of Achilles Tendon Rupture DOI Open Access
Christopher M. Collins,

Peter A Giammanco,

Monica Guirgus

et al.

Cureus, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 31, 2025

The rise of artificial intelligence (AI), including generative chatbots like ChatGPT (OpenAI, San Francisco, CA, USA), has revolutionized many fields, healthcare. Patients have gained the ability to prompt generate purportedly accurate and individualized healthcare content. This study analyzed readability quality answers Achilles tendon rupture questions from six AI evaluate distinguish their potential as patient education resources. models used were 3.5, 4, Gemini 1.0 (previously Bard; Google, Mountain View, 1.5 Pro, Claude (Anthropic, USA) Grok (xAI, Palo Alto, without prior prompting. Each was asked 10 common about rupture, determined by five orthopaedic surgeons. responses measured using Flesch-Kincaid Reading Grade Level, Gunning Fog, SMOG (Simple Measure Gobbledygook). response subsequently graded DISCERN criteria blinded generated statistically significant differences in ease (closest average American reading level) than Claude. Additionally, mean scores demonstrated significantly higher (63.0±5.1) 4 (63.8±6.2) 3.5 (53.8±3.8), (55.0±3.8), (54.2±4.8). However, overall (question 16, DISCERN) each model averaged at an above-average level (range, 3.4-4.4). Our results indicate that can potentially serve resources alongside physicians. Although some lacked sufficient content, performed above quality. With lowest highest scores, outperformed ChatGPT, Claude, emerged simplest most reliable chatbot regarding management rupture.

Language: Английский

Citations

0

Bridging the Gap in Neonatal Care: Evaluating AI Chatbots for Chronic Neonatal Lung Disease and Home Oxygen Therapy Management DOI Open Access

Weiqin Liu,

Hong Wei,

Lingling Xiang

et al.

Pediatric Pulmonology, Journal Year: 2025, Volume and Issue: 60(3)

Published: March 1, 2025

To evaluate the accuracy and comprehensiveness of eight free, publicly available large language model (LLM) chatbots in addressing common questions related to chronic neonatal lung disease (CNLD) home oxygen therapy (HOT). Twenty CNLD HOT-related were curated across nine domains. Responses from ChatGPT-3.5, Google Bard, Bing Chat, Claude 3.5 Sonnet, ERNIE Bot 3.5, GLM-4 generated evaluated by three experienced neonatologists using Likert scales for comprehensiveness. Updated LLM models (ChatGPT-4o mini Gemini 2.0 Flash Experimental) incorporated assess rapid technological advancement. Statistical analyses included ANOVA, Kruskal-Wallis tests, intraclass correlation coefficients. Chat Sonnet demonstrated superior performance, with highest mean scores (5.78 ± 0.48 5.75 0.54, respectively) competence (2.65 0.58 2.80 0.41, respectively). In subsequent testing, Experimental ChatGPT-4o achieved comparable high performance. Performance varied domains, all excelling "equipment safety protocols" "caregiver support." showed self-correction capabilities when prompted. LLMs promise accurate CNLD/HOT information. However, performance variability risk misinformation necessitate expert oversight continued refinement before widespread clinical implementation.

Language: Английский

Citations

0

Evaluating AI-generated patient education materials for spinal surgeries: Comparative analysis of readability and DISCERN quality across ChatGPT and deepseek models DOI Creative Commons
Mi Zhou, Yunfeng Pan, Yuye Zhang

et al.

International Journal of Medical Informatics, Journal Year: 2025, Volume and Issue: unknown, P. 105871 - 105871

Published: March 1, 2025

Language: Английский

Citations

0