Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review DOI Creative Commons

Cindy Ho,

Tiffany Tian,

Alessandra T. Ayers

et al.

BMC Medical Informatics and Decision Making, Journal Year: 2024, Volume and Issue: 24(1)

Published: Nov. 26, 2024

The large language models (LLMs), most notably ChatGPT, released since November 30, 2022, have prompted shifting attention to their use in medicine, particularly for supporting clinical decision-making. However, there is little consensus the medical community on how LLM performance contexts should be evaluated. We performed a literature review of PubMed identify publications between December 1, and April 2024, that discussed assessments LLM-generated diagnoses or treatment plans. selected 108 relevant articles from analysis. frequently used LLMs were GPT-3.5, GPT-4, Bard, LLaMa/Alpaca-based models, Bing Chat. five criteria scoring outputs "accuracy", "completeness", "appropriateness", "insight", "consistency". defining high-quality been consistently by researchers over past 1.5 years. identified high degree variation studies reported findings assessed performance. Standardized reporting qualitative evaluation metrics assess quality can developed facilitate research healthcare.

Language: Английский

Current Practices and Perspectives of Artificial Intelligence in the Clinical Management of Eating Disorders: Insights From Clinicians and Community Participants DOI Creative Commons
Jake Linardon, Claudia Liu, Mariel Messer

et al.

International Journal of Eating Disorders, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 19, 2025

ABSTRACT Objective Artificial intelligence (AI) could revolutionize the delivery of mental health care, helping to streamline clinician workflows and assist with diagnostic treatment decisions. Yet, before AI can be integrated into practice, it is necessary understand perspectives these tools inform facilitators barriers their uptake. We gathered data on community participant incorporating in clinical management eating disorders. Method A survey was distributed internationally clinicians ( n = 116) experience disorder (psychologists, psychiatrists, etc.) participants 155) who reported occurrence behaviors. Results 59% use systems (most commonly ChatGPT) for professional reasons, compared 18% using them help‐related purposes. While more than half (58%) (53%) were open help support them, fewer enthusiastic about integration (40% 27%, respectively) believed that they would significantly improve client outcomes (28% 13%, respectively). Nine 10 agreed may improperly used if individuals are not adequately trained, pose new privacy security concerns. Most will convenient, beneficial administrative tasks, an avenue continuous support, but never outperform human relational skills. Conclusion many recognize its possible wide‐ranging benefits, most remain cautious uncertain implementation.

Language: Английский

Citations

7

Large Language Models for Chatbot Health Advice Studies DOI Creative Commons
Bright Huo,

Amy Boyle,

Nana Marfo

et al.

JAMA Network Open, Journal Year: 2025, Volume and Issue: 8(2), P. e2457879 - e2457879

Published: Feb. 4, 2025

Importance There is much interest in the clinical integration of large language models (LLMs) health care. Many studies have assessed ability LLMs to provide advice, but quality their reporting uncertain. Objective To perform a systematic review examine variability among peer-reviewed evaluating performance generative artificial intelligence (AI)–driven chatbots for summarizing evidence and providing advice inform development Chatbot Assessment Reporting Tool (CHART). Evidence Review A search MEDLINE via Ovid, Embase Elsevier, Web Science from inception October 27, 2023, was conducted with help sciences librarian yield 7752 articles. Two reviewers screened articles by title abstract followed full-text identify primary accuracy AI-driven (chatbot studies). then performed data extraction 137 eligible studies. Findings total were included. Studies examined topics surgery (55 [40.1%]), medicine (51 [37.2%]), care (13 [9.5%]). focused on treatment (91 [66.4%]), diagnosis (60 [43.8%]), or disease prevention (29 [21.2%]). Most (136 [99.3%]) evaluated inaccessible, closed-source did not enough information version LLM under evaluation. All lacked sufficient description characteristics, including temperature, token length, fine-tuning availability, layers, other details. describe prompt engineering phase study. The date querying reported 54 (39.4%) (89 [65.0%]) used subjective means define successful chatbot, while less than one-third addressed ethical, regulatory, patient safety implications LLMs. Conclusions Relevance In this chatbot studies, heterogeneous may CHART standards. Ethical, considerations are crucial as grows

Language: Английский

Citations

5

Global insights and the impact of generative AI-ChatGPT on multidisciplinary: a systematic review and bibliometric analysis DOI Creative Commons
Nauman Khan, Zahid A. Khan, Anis Koubâa

et al.

Connection Science, Journal Year: 2024, Volume and Issue: 36(1)

Published: May 16, 2024

In 2022, OpenAI's unveiling of generative AI Large Language Models (LLMs)- ChatGPT, heralded a significant leap forward in human-machine interaction through cutting-edge technologies. With its surging popularity, scholars across various fields have begun to delve into the myriad applications ChatGPT. While existing literature reviews on LLMs like ChatGPT are available, there is notable absence systematic (SLRs) and bibliometric analyses assessing research's multidisciplinary geographical breadth. This study aims bridge this gap by synthesising evaluating how has been integrated diverse research areas, focussing scope distribution studies. Through review scholarly articles, we chart global utilisation scientific domains, exploring contribution advancing paradigms adoption trends among different disciplines. Our findings reveal widespread endorsement multiple fields, with implementations healthcare (38.6%), computer science/IT (18.6%), education/research (17.3%). Moreover, our demographic analysis underscores ChatGPT's reach accessibility, indicating participation from 80 unique countries ChatGPT-related research, most frequent keyword occurrence, USA (719), China (181), India (157) leading contributions. Additionally, highlights roles institutions such as King Saud University, All Institute Medical Sciences, Taipei University pioneering dataset. not only sheds light vast opportunities challenges posed pursuits but also acts pivotal resource for future inquiries. It emphasises that (LLM) role revolutionising every field. The insights provided paper particularly valuable academics, researchers, practitioners disciplines, well policymakers looking grasp extensive impact technologies community.

Language: Английский

Citations

13

Large Language Models for Mental Health Applications: A Systematic Review (Preprint) DOI Creative Commons
Zhijun Guo, Alvina G. Lai, Johan H. Thygesen

et al.

JMIR Mental Health, Journal Year: 2024, Volume and Issue: 11, P. e57400 - e57400

Published: Sept. 3, 2024

Background Large language models (LLMs) are advanced artificial neural networks trained on extensive datasets to accurately understand and generate natural language. While they have received much attention demonstrated potential in digital health, their application mental particularly clinical settings, has generated considerable debate. Objective This systematic review aims critically assess the use of LLMs specifically focusing applicability efficacy early screening, interventions, settings. By systematically collating assessing evidence from current studies, our work analyzes models, methodologies, data sources, outcomes, thereby highlighting challenges present, prospects for use. Methods Adhering PRISMA (Preferred Reporting Items Systematic Reviews Meta-Analyses) guidelines, this searched 5 open-access databases: MEDLINE (accessed by PubMed), IEEE Xplore, Scopus, JMIR, ACM Digital Library. Keywords used were (mental health OR illness disorder psychiatry) AND (large models). study included articles published between January 1, 2017, April 30, 2024, excluded languages other than English. Results In total, 40 evaluated, including 15 (38%) conditions suicidal ideation detection through text analysis, 7 (18%) as conversational agents, 18 (45%) applications evaluations health. show good effectiveness detecting issues providing accessible, destigmatized eHealth services. However, assessments also indicate that risks associated with might surpass benefits. These include inconsistencies text; production hallucinations; absence a comprehensive, benchmarked ethical framework. Conclusions examines inherent risks. The identifies several issues: lack multilingual annotated experts, concerns regarding accuracy reliability content, interpretability due “black box” nature LLMs, ongoing dilemmas. clear, framework; privacy issues; overreliance both physicians patients, which could compromise traditional medical practices. As result, should not be considered substitutes professional rapid development underscores valuable aids, emphasizing need continued research area. Trial Registration PROSPERO CRD42024508617; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=508617

Language: Английский

Citations

13

Applications of large language models in psychiatry: a systematic review DOI Creative Commons
Mahmud Omar, Shelly Soffer, Alexander W. Charney

et al.

Frontiers in Psychiatry, Journal Year: 2024, Volume and Issue: 15

Published: June 24, 2024

Background With their unmatched ability to interpret and engage with human language context, large models (LLMs) hint at the potential bridge AI cognitive processes. This review explores current application of LLMs, such as ChatGPT, in field psychiatry. Methods We followed PRISMA guidelines searched through PubMed, Embase, Web Science, Scopus, up until March 2024. Results From 771 retrieved articles, we included 16 that directly examine LLMs’ use particularly ChatGPT GPT-4, showed diverse applications clinical reasoning, social media, education within They can assist diagnosing mental health issues, managing depression, evaluating suicide risk, supporting field. However, our also points out limitations, difficulties complex cases underestimation risks. Conclusion Early research psychiatry reveals versatile applications, from diagnostic support educational roles. Given rapid pace advancement, future investigations are poised explore extent which these might redefine traditional roles care.

Language: Английский

Citations

12

Evaluating the Reliability of ChatGPT for Health-Related Questions: A Systematic Review DOI Creative Commons
Mohammad Beheshti, Imad Eddine Toubal, Khuder Alaboud

et al.

Informatics, Journal Year: 2025, Volume and Issue: 12(1), P. 9 - 9

Published: Jan. 17, 2025

The rapid advancement of large language models like ChatGPT has significantly impacted natural processing, expanding its applications across various fields, including healthcare. However, there remains a significant gap in understanding the consistency and reliability ChatGPT’s performance different medical domains. We conducted this systematic review according to an LLM-assisted PRISMA setup. high-recall search term “ChatGPT” yielded 1101 articles from 2023 onwards. Through dual-phase screening process, initially automated via subsequently manually by human reviewers, 128 studies were included. covered range specialties, focusing on diagnosis, disease management, patient education. assessment metrics varied, but most compared accuracy against evaluations clinicians or reliable references. In several areas, demonstrated high accuracy, underscoring effectiveness. some contexts revealed lower accuracy. mixed outcomes domains emphasize challenges opportunities integrating AI into certain areas suggests that substantial utility, yet inconsistent all indicates need for ongoing evaluation refinement. This highlights potential improve healthcare delivery alongside necessity continued research ensure reliability.

Language: Английский

Citations

1

Using ChatGPT in Psychiatry to Design Script Concordance Tests in Undergraduate Medical Education: Mixed Methods Study DOI Creative Commons
Alexandre Hudon, Barnabé Kiepura, Myriam Pelletier

et al.

JMIR Medical Education, Journal Year: 2024, Volume and Issue: 10, P. e54067 - e54067

Published: April 4, 2024

Abstract Background Undergraduate medical studies represent a wide range of learning opportunities served in the form various teaching-learning modalities for learners. A clinical scenario is frequently used as modality, followed by multiple-choice and open-ended questions among other teaching methods. As such, script concordance tests (SCTs) can be to promote higher level reasoning. Recent technological developments have made generative artificial intelligence (AI)–based systems such ChatGPT (OpenAI) available assist clinician-educators creating instructional materials. Objective The main objective this project explore how SCTs generated compared produced experts on 3 major elements: (stem), questions, expert opinion. Methods This mixed method study evaluated ChatGPT-generated with expert-created using predefined framework. Clinician-educators well resident doctors psychiatry involved undergraduate education Quebec, Canada, via web-based survey 6 criteria: scenario, They were also asked describe strengths weaknesses SCTs. Results total 102 respondents assessed There no significant distinctions between 2 types concerning ( P =.84), =.99), opinion =.07), interpretated respondents. Indeed, struggled differentiate ChatGPT- expert-generated showcased promise expediting SCT design, aligning Diagnostic Statistical Manual Mental Disorders, Fifth Edition criteria, albeit tendency toward caricatured scenarios simplistic content. Conclusions first concentrate design supported AI period where medicine changing swiftly technologies from are expanding much faster. suggests that valuable tool educational materials, further validation essential ensure efficacy accuracy.

Language: Английский

Citations

7

Evaluating Chat Generative Pre-trained Transformer Responses to Common Pediatric In-toeing Questions DOI
Jason Zarahi Amaral, Rebecca J. Schultz, Benjamin M. Martin

et al.

Journal of Pediatric Orthopaedics, Journal Year: 2024, Volume and Issue: 44(7), P. e592 - e597

Published: April 30, 2024

Objective: Chat generative pre-trained transformer (ChatGPT) has garnered attention in health care for its potential to reshape patient interactions. As patients increasingly rely on artificial intelligence platforms, concerns about information accuracy arise. In-toeing, a common lower extremity variation, often leads pediatric orthopaedic referrals despite observation being the primary treatment. Our study aims assess ChatGPT’s responses in-toeing questions, contributing discussions innovation and technology education. Methods: We compiled list of 34 questions from “Frequently Asked Questions” sections 9 care–affiliated websites, identifying 25 as most encountered. On January 17, 2024, we queried ChatGPT 3.5 separate sessions recorded responses. These were posed again 21, reproducibility. Two surgeons evaluated using scale “excellent (no clarification)” “unsatisfactory (substantial clarification).” Average ratings used when evaluators’ grades within one level each other. In discordant cases, senior author provided decisive rating. Results: found 46% “excellent” 44% “satisfactory (minimal addition, 8% cases (moderate 2% “unsatisfactory.” Questions had appropriate readability, with an average Flesch-Kincaid Grade Level 4.9 (±2.1). However, at collegiate level, averaging 12.7 (±1.4). No significant differences observed between question topics. Furthermore, exhibited moderate consistency after repeated queries, evidenced by Spearman rho coefficient 0.55 ( P = 0.005). The chatbot appropriately described normal or spontaneously resolving 62% consistently recommended evaluation provider 100%. Conclusion: presented serviceable, though not perfect, representation diagnosis management while demonstrating reproducibility utility could be enhanced improving readability incorporating evidence-based guidelines. Evidence: IV—diagnostic.

Language: Английский

Citations

7

Diagnostic accuracy of large language models in psychiatry DOI
Omid Kohandel Gargari, Farhad Fatehi, Ida Mohammadi

et al.

Asian Journal of Psychiatry, Journal Year: 2024, Volume and Issue: 100, P. 104168 - 104168

Published: July 25, 2024

Language: Английский

Citations

7

ChatGPT-3.5 passes Poland’s medical final examination—Is it possible for ChatGPT to become a doctor in Poland? DOI Creative Commons
Szymon Suwała,

Paulina Szulc,

Cezary Guzowski

et al.

SAGE Open Medicine, Journal Year: 2024, Volume and Issue: 12

Published: Jan. 1, 2024

Objectives: ChatGPT is an advanced chatbot based on Large Language Model that has the ability to answer questions. Undoubtedly, capable of transforming communication, education, and customer support; however, can it play role a doctor? In Poland, prior obtaining medical diploma, candidates must successfully pass Medical Final Examination. Methods: The purpose this research was determine how well performed Polish Examination, which passing required become doctor in Poland (an exam considered passed if at least 56% tasks are answered correctly). A total 2138 categorized Examination questions (from 11 examination sessions held between 2013–2015 2021–2023) were presented ChatGPT-3.5 from 19 26 May 2023. For further analysis, divided into quintiles difficulty duration, as question types (simple A-type or complex K-type). answers provided by compared official key, reviewed for any changes resulting advancement knowledge. Results: correctly 53.4%–64.9% 8 out sessions, achieved scores (60%). correlation efficacy artificial intelligence level complexity, difficulty, length found be negative. AI outperformed humans one category: psychiatry (77.18% vs. 70.25%, p = 0.081). Conclusions: performance deemed satisfactory; observed markedly inferior human graduates majority instances. Despite its potential utility many areas, constrained inherent limitations prevent entirely supplanting expertise

Language: Английский

Citations

6