Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review DOI Creative Commons

Cindy Ho,

Tiffany Tian,

Alessandra T. Ayers

et al.

BMC Medical Informatics and Decision Making, Journal Year: 2024, Volume and Issue: 24(1)

Published: Nov. 26, 2024

The large language models (LLMs), most notably ChatGPT, released since November 30, 2022, have prompted shifting attention to their use in medicine, particularly for supporting clinical decision-making. However, there is little consensus the medical community on how LLM performance contexts should be evaluated. We performed a literature review of PubMed identify publications between December 1, and April 2024, that discussed assessments LLM-generated diagnoses or treatment plans. selected 108 relevant articles from analysis. frequently used LLMs were GPT-3.5, GPT-4, Bard, LLaMa/Alpaca-based models, Bing Chat. five criteria scoring outputs "accuracy", "completeness", "appropriateness", "insight", "consistency". defining high-quality been consistently by researchers over past 1.5 years. identified high degree variation studies reported findings assessed performance. Standardized reporting qualitative evaluation metrics assess quality can developed facilitate research healthcare.

Language: Английский

A scoping review of large language models for generative tasks in mental health care DOI Creative Commons
Yining Hua,

Hongbin Na,

Zehan Li

et al.

npj Digital Medicine, Journal Year: 2025, Volume and Issue: 8(1)

Published: April 30, 2025

Large language models (LLMs) show promise in mental health care for handling human-like conversations, but their effectiveness remains uncertain. This scoping review synthesizes existing research on LLM applications care, reviews model performance and clinical effectiveness, identifies gaps current evaluation methods following a structured framework, provides recommendations future development. A systematic search identified 726 unique articles, of which 16 met the inclusion criteria. These studies, encompassing such as assistance, counseling, therapy, emotional support, initial promises. However, were often non-standardized, with most studies relying ad-hoc scales that limit comparability robustness. reliance prompt-tuning proprietary models, OpenAI's GPT series, also raises concerns about transparency reproducibility. As evidence does not fully support use standalone interventions, more rigorous development guidelines are needed safe, effective integration.

Language: Английский

Citations

0

Exploring the Potential of ChatGPT-4 in Predicting Refractive Surgery Categorizations: Comparative Study DOI Creative Commons
Aleksandar Ćirković, Toam Katz

JMIR Formative Research, Journal Year: 2023, Volume and Issue: 7, P. e51798 - e51798

Published: Dec. 4, 2023

Refractive surgery research aims to optimally precategorize patients by their suitability for various types of surgery. Recent advances have led the development artificial intelligence-powered algorithms, including machine learning approaches, assess risks and enhance workflow. Large language models (LLMs) like ChatGPT-4 (OpenAI LP) emerged as potential general intelligence tools that can assist across disciplines, possibly refractive decision-making. However, actual capabilities in precategorizing based on real-world parameters remain unexplored.

Language: Английский

Citations

10

ChatGPT in psychiatry: promises and pitfalls DOI Creative Commons
Rebecca Shin-Yee Wong

The Egyptian Journal of Neurology Psychiatry and Neurosurgery, Journal Year: 2024, Volume and Issue: 60(1)

Published: Jan. 30, 2024

Abstract ChatGPT has become a hot topic of discussion since its release in November 2022. The number publications on the potential applications various fields is rise. However, viewpoints use psychiatry are lacking. This article aims to address this gap by examining promises and pitfalls using psychiatric practice. While offers several opportunities, further research warranted, as chatbots like raises technical ethical concerns. Some practical ways addressing challenges for also discussed.

Language: Английский

Citations

3

Artificial Intelligence in Healthcare and Psychiatry DOI
K. Krysta,

Rachael Cullivan,

Andrew Brittlebank

et al.

Academic Psychiatry, Journal Year: 2024, Volume and Issue: unknown

Published: Sept. 23, 2024

Language: Английский

Citations

3

Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review DOI Creative Commons

Cindy Ho,

Tiffany Tian,

Alessandra T. Ayers

et al.

BMC Medical Informatics and Decision Making, Journal Year: 2024, Volume and Issue: 24(1)

Published: Nov. 26, 2024

The large language models (LLMs), most notably ChatGPT, released since November 30, 2022, have prompted shifting attention to their use in medicine, particularly for supporting clinical decision-making. However, there is little consensus the medical community on how LLM performance contexts should be evaluated. We performed a literature review of PubMed identify publications between December 1, and April 2024, that discussed assessments LLM-generated diagnoses or treatment plans. selected 108 relevant articles from analysis. frequently used LLMs were GPT-3.5, GPT-4, Bard, LLaMa/Alpaca-based models, Bing Chat. five criteria scoring outputs "accuracy", "completeness", "appropriateness", "insight", "consistency". defining high-quality been consistently by researchers over past 1.5 years. identified high degree variation studies reported findings assessed performance. Standardized reporting qualitative evaluation metrics assess quality can developed facilitate research healthcare.

Language: Английский

Citations

3