Human versus artificial intelligence: evaluating ChatGPT’s performance in conducting published systematic reviews with meta-analysis in chronic pain research DOI Creative Commons
Anam Purewal,

Kalli Fautsch,

Johana Klasová

et al.

Regional Anesthesia & Pain Medicine, Journal Year: 2025, Volume and Issue: unknown, P. rapm - 106358

Published: Feb. 16, 2025

Introduction Artificial intelligence (AI), particularly large-language models like Chat Generative Pre-Trained Transformer (ChatGPT), has demonstrated potential in streamlining research methodologies. Systematic reviews and meta-analyses, often considered the pinnacle of evidence-based medicine, are inherently time-intensive demand meticulous planning, rigorous data extraction, thorough analysis, careful synthesis. Despite promising applications AI, its utility conducting systematic with meta-analysis remains unclear. This study evaluated ChatGPT’s accuracy key tasks a review meta-analysis. Methods validation used from published on emotional functioning after spinal cord stimulation. ChatGPT-4o performed title/abstract screening, full-text selection, pooling for this Comparisons were made against human-executed steps, which gold standard. Outcomes interest included accuracy, sensitivity, specificity, positive predictive value, negative value screening tasks. We also assessed discrepancies pooled effect estimates forest plot generation. Results For title abstract ChatGPT achieved an 70.4%, sensitivity 54.9%, specificity 80.1%. In phase, was 68.4%, 75.6%, 66.8%. successfully five plots, achieving 100% calculating mean differences, 95% CIs, heterogeneity ( I 2 score tau-squared values) most outcomes, minor values (range 0.01–0.05). Forest plots showed no significant discrepancies. Conclusion demonstrates modest to moderate selection tasks, but performs well meta-analytic calculations. These findings underscore AI augment methodologies, while emphasizing need human oversight ensure integrity workflows.

Language: Английский

Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study DOI Creative Commons
Eddie Guo, Mehul Gupta, Jiawen Deng

et al.

Journal of Medical Internet Research, Journal Year: 2023, Volume and Issue: 26, P. e48996 - e48996

Published: Sept. 28, 2023

The systematic review of clinical research papers is a labor-intensive and time-consuming process that often involves the screening thousands titles abstracts. accuracy efficiency this are critical for quality subsequent health care decisions. Traditional methods rely heavily on human reviewers, requiring significant investment time resources.

Language: Английский

Citations

51

Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain DOI Creative Commons
Fabio Dennstädt, Johannes Zink, Paul Martin Putora

et al.

Systematic Reviews, Journal Year: 2024, Volume and Issue: 13(1)

Published: June 15, 2024

Abstract Background Systematically screening published literature to determine the relevant publications synthesize in a review is time-consuming and difficult task. Large language models (LLMs) are an emerging technology with promising capabilities for automation of language-related tasks that may be useful such purpose. Methods LLMs were used as part automated system evaluate relevance certain topic based on defined criteria title abstract each publication. A Python script was created generate structured prompts consisting text strings instruction, title, abstract, provided LLM. The publication evaluated by LLM Likert scale (low high relevance). By specifying threshold, different classifiers inclusion/exclusion could then defined. approach four openly available ten data sets biomedical reviews newly human-created set hypothetical new systematic review. Results performance varied depending being analyzed. Regarding sensitivity/specificity, yielded 94.48%/31.78% FlanT5 model, 97.58%/19.12% OpenHermes-NeuralChat 81.93%/75.19% Mixtral model 97.58%/38.34% Platypus 2 sets. same 100% sensitivity at specificity 12.58%, 4.54%, 62.47%, 24.74% set. Changing standard settings (minor adaption instruction prompt and/or changing range from 1–5 1–10) had considerable impact performance. Conclusions can scientific show some results. To date, little known about how well systems would perform if prospectively when conducting what further implications this might have. However, it likely future researchers will increasingly use evaluating classifying publications.

Language: Английский

Citations

21

Sensitivity and Specificity of Using GPT-3.5 Turbo Models for Title and Abstract Screening in Systematic Reviews and Meta-analyses DOI
Viet-Thi Tran, Gerald Gartlehner, Sally Yaacoub

et al.

Annals of Internal Medicine, Journal Year: 2024, Volume and Issue: 177(6), P. 791 - 799

Published: May 20, 2024

Systematic reviews are performed manually despite the exponential growth of scientific literature.

Language: Английский

Citations

16

Generative artificial intelligence in graduate medical education DOI Creative Commons

Ravi Janumpally,

Suparna Nanua,

Andy Ngo

et al.

Frontiers in Medicine, Journal Year: 2025, Volume and Issue: 11

Published: Jan. 10, 2025

Generative artificial intelligence (GenAI) is rapidly transforming various sectors, including healthcare and education. This paper explores the potential opportunities risks of GenAI in graduate medical education (GME). We review existing literature provide commentary on how could impact GME, five key areas opportunity: electronic health record (EHR) workload reduction, clinical simulation, individualized education, research analytics support, decision support. then discuss significant risks, inaccuracy overreliance AI-generated content, challenges to authenticity academic integrity, biases AI outputs, privacy concerns. As technology matures, it will likely come have an important role future but its integration should be guided by a thorough understanding both benefits limitations.

Language: Английский

Citations

2

Human-Comparable Sensitivity of Large Language Models in Identifying Eligible Studies Through Title and Abstract Screening: 3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews DOI Creative Commons
Kentaro Matsui, Tomohiro Utsumi, Yumi Aoki

et al.

Journal of Medical Internet Research, Journal Year: 2024, Volume and Issue: 26, P. e52758 - e52758

Published: Aug. 16, 2024

Background The screening process for systematic reviews is resource-intensive. Although previous machine learning solutions have reported reductions in workload, they risked excluding relevant papers. Objective We evaluated the performance of a 3-layer method using GPT-3.5 and GPT-4 to streamline title abstract-screening reviews. Our goal develop that maximizes sensitivity identifying records. Methods conducted screenings on 2 our related treatment bipolar disorder, with 1381 records from first review 3146 second. Screenings were (gpt-3.5-turbo-0125) (gpt-4-0125-preview) across three layers: (1) research design, (2) target patients, (3) interventions controls. was prompts tailored each study. During this process, information extraction according study’s inclusion criteria optimization carried out GPT-4–based flow without manual adjustments. Records at layer, those meeting all layers subsequently judged as included. Results On both able about 110 per minute, total time required second studies approximately 1 hour hours, respectively. In study, sensitivities/specificities 0.900/0.709 0.806/0.996, Both by 6 used meta-analysis 0.958/0.116 0.875/0.855, sensitivities align human evaluators: 0.867-1.000 study 0.776-0.979 9 After accounting justifiably excluded GPT-4, 0.962/0.996 0.943/0.855 Further investigation indicated cases incorrectly due lack domain knowledge, while misinterpretations criteria. Conclusions demonstrated acceptable level specificity supports its practical application screenings. Future should aim generalize approach explore effectiveness diverse settings, medical nonmedical, fully establish use operational feasibility.

Language: Английский

Citations

10

Potential Roles of Large Language Models in the Production of Systematic Reviews and Meta-Analyses DOI Creative Commons
Xufei Luo,

Fengxian Chen,

Di Zhu

et al.

Journal of Medical Internet Research, Journal Year: 2024, Volume and Issue: 26, P. e56780 - e56780

Published: May 31, 2024

Large language models (LLMs) such as ChatGPT have become widely applied in the field of medical research. In process conducting systematic reviews, similar tools can be used to expedite various steps, including defining clinical questions, performing literature search, document screening, information extraction, and refinement, thereby conserving resources enhancing efficiency. However, when using LLMs, attention should paid transparent reporting, distinguishing between genuine false content, avoiding academic misconduct. this viewpoint, we highlight potential roles LLMs creation reviews meta-analyses, elucidating their advantages, limitations, future research directions, aiming provide insights guidance for authors planning meta-analyses.

Language: Английский

Citations

9

Development of Prompt Templates for Large Language Model–Driven Screening in Systematic Reviews DOI
Christian Cao,

Jason Sang,

Rohit Arora

et al.

Annals of Internal Medicine, Journal Year: 2025, Volume and Issue: unknown

Published: Feb. 24, 2025

Systematic reviews (SRs) are hindered by the initial rigorous article screen, which delays access to reliable information synthesis. To develop generic prompt templates for large language model (LLM)-driven abstract and full-text screening that can be adapted different reviews. Diagnostic test accuracy. 48 425 citations were tested across 10 SRs. Full-text evaluated all 12 690 freely available articles from original search. Prompt development used GPT4-0125-preview (OpenAI). None. Large models prompted include or exclude based on SR eligibility criteria. Model outputs compared with author decisions after evaluate performance (accuracy, sensitivity, specificity). Optimized prompts using achieved a weighted sensitivity of 97.7% (range, 86.7% 100%) specificity 85.2% 68.3% 95.9%) in 96.5% 89.7% 100.0%) 91.2% 80.7% In contrast, zero-shot had poor (49.0% abstract, 49.1% full-text). Across LLMs, Claude-3.5 (Anthropic) GPT4 variants similar performance, whereas Gemini Pro (Google) GPT3.5 (OpenAI) underperformed. Direct costs 000 differed substantially: Where single human was estimated require more than 83 hours $1666.67 USD, our LLM-based approach completed under 1 day $157.02 USD. Further optimizations may exist. Retrospective study. Convenience sample evaluations limited free PubMed Central articles. A achieving high other SRs LLMs developed. Our prompting innovations have value investigators researchers conducting criteria-based tasks medical sciences.

Language: Английский

Citations

1

Enhancing title and abstract screening for systematic reviews with GPT-3.5 turbo DOI Creative Commons
Omid Kohandel Gargari, Mohammad Hossein Mahmoudi, Mahsa Hajisafarali

et al.

BMJ evidence-based medicine, Journal Year: 2023, Volume and Issue: 29(1), P. 69 - 70

Published: Nov. 21, 2023

Language: Английский

Citations

18

Artificial intelligence in clinical practice: A look at ChatGPT DOI Open Access
Jiawen Deng, Kiyan Heybati, Ye‐Jean Park

et al.

Cleveland Clinic Journal of Medicine, Journal Year: 2024, Volume and Issue: 91(3), P. 173 - 180

Published: March 1, 2024

Language: Английский

Citations

7

Prompting is all you need: LLMs for systematic review screening DOI Creative Commons
Christian Cao,

Jason Sang,

Rohit Arora

et al.

medRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: June 3, 2024

Abstract Systematic reviews (SRs) are the highest standard of evidence, shaping clinical practice guidelines, policy decisions, and research priorities. However, their labor-intensive nature, including an initial rigorous article screen by at least two investigators, delays access to reliable information synthesis. Here, we demonstrate that large language models (LLMs) with intentional prompting can match human screening performance. We introduce Framework Chain-of-Thought, a novel approach directs LLMs systematically reason against predefined frameworks. evaluated our prompts across ten SRs covering four common types SR questions (i.e., prevalence, intervention benefits, diagnostic test accuracy, prognosis), achieving mean accuracy 93.6% (range: 83.3-99.6%) sensitivity 97.5% (89.7-100%) in full-text screening. Compared experienced reviewers (mean 92.4% [76.8-97.8%], 75.1% [44.1-100%]), prompt demonstrated significantly higher (p<0.05), one review comparable five (p>0.05). While traditional for 7000 articles required 530 hours $10,000 USD, completed day $430 USD. Our results establish perform performance matching experts, setting foundation end-to-end automated SRs.

Language: Английский

Citations

6