Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review DOI Creative Commons

Cindy Ho,

Tiffany Tian,

Alessandra T. Ayers

et al.

BMC Medical Informatics and Decision Making, Journal Year: 2024, Volume and Issue: 24(1)

Published: Nov. 26, 2024

The large language models (LLMs), most notably ChatGPT, released since November 30, 2022, have prompted shifting attention to their use in medicine, particularly for supporting clinical decision-making. However, there is little consensus the medical community on how LLM performance contexts should be evaluated. We performed a literature review of PubMed identify publications between December 1, and April 2024, that discussed assessments LLM-generated diagnoses or treatment plans. selected 108 relevant articles from analysis. frequently used LLMs were GPT-3.5, GPT-4, Bard, LLaMa/Alpaca-based models, Bing Chat. five criteria scoring outputs "accuracy", "completeness", "appropriateness", "insight", "consistency". defining high-quality been consistently by researchers over past 1.5 years. identified high degree variation studies reported findings assessed performance. Standardized reporting qualitative evaluation metrics assess quality can developed facilitate research healthcare.

Language: Английский

Use of generative artificial intelligence (AI) in psychiatry and mental health care: a systematic review DOI Creative Commons
Sara Kolding, Robert M. Lundin, Lasse Hansen

et al.

Acta Neuropsychiatrica, Journal Year: 2024, Volume and Issue: unknown, P. 1 - 14

Published: Nov. 11, 2024

Tools based on generative artificial intelligence (AI) such as ChatGPT have the potential to transform modern society, including field of medicine. Due prominent role language in psychiatry, e.g., for diagnostic assessment and psychotherapy, these tools may be particularly useful within this medical field. Therefore, aim study was systematically review literature AI applications psychiatry mental health.

Language: Английский

Citations

5

Assessment Study of ChatGPT-3.5’s Performance on the Final Polish Medical Examination: Accuracy in Answering 980 Questions DOI Open Access

Julia Siebielec,

Michał Ordak,

Agata Oskroba

et al.

Healthcare, Journal Year: 2024, Volume and Issue: 12(16), P. 1637 - 1637

Published: Aug. 16, 2024

The use of artificial intelligence (AI) in education is dynamically growing, and models such as ChatGPT show potential enhancing medical education. In Poland, to obtain a diploma, candidates must pass the Medical Final Examination, which consists 200 questions with one correct answer per question, administered Polish, assesses students' comprehensive knowledge readiness for clinical practice. aim this study was determine how ChatGPT-3.5 handles included exam. This considered 980 from five examination sessions Examination conducted by Center years 2022-2024. analysis field medicine, difficulty index questions, their type, namely theoretical versus case-study questions. average rate achieved hovered around 60% lower (p < 0.001) than score examinees. lowest percentage answers hematology (42.1%), while highest endocrinology (78.6%). showed statistically significant correlation correctness = 0.04). Questions provided incorrect had responses. type analyzed did not significantly affect 0.46). indicates that can be an effective tool assisting passing final exam, but results should interpreted cautiously. It recommended further verify using various AI tools.

Language: Английский

Citations

4

The Impact of Artificial Intelligence on Human Sexuality: A Five-Year Literature Review 2020–2024 DOI Creative Commons
Nicola Döring,

Thuy Dung Le,

Laura M. Vowels

et al.

Current Sexual Health Reports, Journal Year: 2024, Volume and Issue: 17(1)

Published: Dec. 4, 2024

Language: Английский

Citations

4

The Goldilocks Zone: Finding the right balance of user and institutional risk for suicide-related generative AI queries DOI Creative Commons
Anna Van Meter, Michael G. Wheaton, Victoria E. Cosgrove

et al.

PLOS Digital Health, Journal Year: 2025, Volume and Issue: 4(1), P. e0000711 - e0000711

Published: Jan. 8, 2025

Generative artificial intelligence (genAI) has potential to improve healthcare by reducing clinician burden and expanding services, among other uses. There is a significant gap between the need for mental health care available clinicians in United States–this makes it an attractive target improved efficiency through genAI. Among most sensitive topics suicide, demand crisis intervention grown recent years. We aimed evaluate quality of genAI tool responses suicide-related queries. entered 10 queries into five tools–ChatGPT 3.5, GPT-4, version GPT-4 safe protected information, Gemini, Bing Copilot. The response each query was coded on seven metrics including presence suicide hotline number, content related evidence-based interventions, supportive content, harmful content. Pooling across tools, (79%) were supportive. Only 24% included number only 4% consistent with prevention interventions. Harmful rare (5%); all such instances delivered Our results suggest that developers have taken very conservative approach constrained their models’ support-seeking, but little else. Finding balance providing much needed information without introducing excessive risk within capabilities developers. At this nascent stage integrating tools systems, ensuring parity should be goal organizations.

Language: Английский

Citations

0

Capable exam-taker and question-generator: the dual role of generative AI in medical education assessment DOI Creative Commons
Yihong Qiu, Chang Liu

Global Medical Education, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 13, 2025

Abstract Objectives Artificial intelligence (AI) is being increasingly used in medical education. This narrative review presents a comprehensive analysis of generative AI tools’ performance answering and generating exam questions, thereby providing broader perspective on AI’s strengths limitations the education context. Methods The Scopus database was searched for studies examinations from 2022 to 2024. Duplicates were removed, relevant full texts retrieved following inclusion exclusion criteria. Narrative descriptive statistics analyze contents included studies. Results A total 70 analysis. results showed that varied when different types questions specialty with best average accuracy psychiatry, influenced by prompts. With well-crafted prompts, models can efficiently produce high-quality examination questions. Conclusion Generative possesses ability answer using carefully designed Its potential use assessment vast, ranging detecting question error, aiding preparation, facilitating formative assessments, supporting personalized learning. However, it’s crucial educators always double-check responses maintain prevent spread misinformation.

Language: Английский

Citations

0

Navigating the Future of Psychiatry: A Review of Research on Opportunities, Applications, and Challenges of Artificial Intelligence DOI Creative Commons
Jake Linardon

Current Treatment Options in Psychiatry, Journal Year: 2025, Volume and Issue: 12(1)

Published: Feb. 17, 2025

Language: Английский

Citations

0

Building Trust with AI: How Essential is Validating AI Models in the Therapeutic Triad of Therapist, Patient, and Artificial Third? Comment on What is the Current and Future Status of Digital Mental Health Interventions? DOI
Alejandro García‐Rudolph,

David Sánchez-Pinsach,

Anna Gilabert

et al.

The Spanish Journal of Psychology, Journal Year: 2025, Volume and Issue: 28

Published: Jan. 1, 2025

Abstract Since the publication of “What is Current and Future Status Digital Mental Health Interventions?” exponential growth widespread adoption ChatGPT have underscored importance reassessing its utility in digital mental health interventions. This review critically examined potential ChatGPT, particularly focusing on application within clinical psychology settings as technology has continued evolving through 2023 2024. Alongside this, our literature spanned US Medical Licensing Examination (USMLE) validations, assessments capacity to interpret human emotions, analyses concerning identification depression determinants at treatment initiation, reported findings. Our evaluated capabilities GPT-3.5 GPT-4.0 separately settings, highlighting conversational AI overcome traditional barriers such stigma accessibility treatment. Each model displayed different levels proficiency, indicating a promising yet cautious pathway for integrating into practices.

Language: Английский

Citations

0

Human–chatbot communication: a systematic review of psychologic studies DOI
Antonina Rafikova, А. Н. Воронин

AI & Society, Journal Year: 2025, Volume and Issue: unknown

Published: March 6, 2025

Language: Английский

Citations

0

Unveiling the potential of large language models in transforming chronic disease management: A mixed-method systematic review (Preprint) DOI Creative Commons
Caixia Li,

Yina Zhao,

Yang Bai

et al.

Journal of Medical Internet Research, Journal Year: 2025, Volume and Issue: 27, P. e70535 - e70535

Published: March 19, 2025

Chronic diseases are a major global health burden, accounting for nearly three-quarters of the deaths worldwide. Large language models (LLMs) advanced artificial intelligence systems with transformative potential to optimize chronic disease management; however, robust evidence is lacking. This review aims synthesize on feasibility, opportunities, and challenges LLMs across management spectrum, from prevention screening, diagnosis, treatment, long-term care. Following PRISMA (Preferred Reporting Items Systematic Reviews Meta-Analysis) guidelines, 11 databases (Cochrane Central Register Controlled Trials, CINAHL, Embase, IEEE Xplore, MEDLINE via Ovid, ProQuest Health & Medicine Collection, ScienceDirect, Scopus, Web Science Core China National Knowledge Internet, SinoMed) were searched April 17, 2024. Intervention simulation studies that examined in included. The methodological quality included was evaluated using rating rubric designed simulation-based research risk bias nonrandomized interventions tool quasi-experimental studies. Narrative analysis descriptive figures used study findings. Random-effects meta-analyses conducted assess pooled effect estimates feasibility management. A total 20 general-purpose (n=17) retrieval-augmented generation-enhanced (n=3) diseases, including cancer, cardiovascular metabolic disorders. demonstrated spectrum by generating relevant, comprehensible, accurate recommendations (pooled rate 71%, 95% CI 0.59-0.83; I2=88.32%) having higher accuracy rates compared (odds ratio 2.89, 1.83-4.58; I2=54.45%). facilitated equitable information access; increased patient awareness regarding ailments, preventive measures, treatment options; promoted self-management behaviors lifestyle modification symptom coping. Additionally, facilitate compassionate emotional support, social connections, care resources improve outcomes diseases. However, face addressing privacy, language, cultural issues; undertaking tasks, medication, comorbidity personalized regimens real-time adjustments multiple modalities. have transform at individual, social, levels; their direct application clinical settings still its infancy. multifaceted approach incorporates data security, domain-specific model fine-tuning, multimodal integration, wearables crucial evolution into invaluable adjuncts professionals PROSPERO CRD42024545412; https://www.crd.york.ac.uk/PROSPERO/view/CRD42024545412.

Language: Английский

Citations

0

Evaluating Large Language Models for Burning Mouth Syndrome Diagnosis DOI Creative Commons
Takayuki Suga, Osamu Uehara, Yoshihiro Abiko

et al.

Journal of Pain Research, Journal Year: 2025, Volume and Issue: Volume 18, P. 1387 - 1405

Published: March 1, 2025

Large language models have been proposed as diagnostic aids across various medical fields, including dentistry. Burning mouth syndrome, characterized by burning sensations in the oral cavity without identifiable cause, poses challenges. This study explores accuracy of large identifying hypothesizing potential limitations. Clinical vignettes 100 synthesized syndrome cases were evaluated using three (ChatGPT-4o, Gemini Advanced 1.5 Pro, and Claude 3.5 Sonnet). Each vignette included patient demographics, symptoms, history. prompted to provide a primary diagnosis, differential diagnoses, their reasoning. Accuracy was determined comparing responses with expert evaluations. ChatGPT achieved an rate 99%, while Gemini's 89% (p < 0.001). Misdiagnoses Persistent Idiopathic Facial Pain combined diagnoses inappropriate conditions. Differences also observed reasoning patterns additional data requests models. Despite high overall accuracy, exhibited variations approaches occasional errors, underscoring importance clinician oversight. Limitations include nature vignettes, over-reliance on exclusionary criteria, challenges differentiating overlapping disorders. demonstrate strong supplementary tools for especially settings lacking specialist expertise. However, reliability depends thorough assessment verification. Integrating into routine diagnostics could enhance early detection management, ultimately improving clinical decision-making dentists specialists alike.

Language: Английский

Citations

0