How to Write Effective Prompts for Screening Biomedical Literature Using Large Language Models DOI Creative Commons
Maria Teresa Colangelo, Stefano Guizzardi,

Marco Meleti

и другие.

BioMedInformatics, Год журнала: 2025, Номер 5(1), С. 15 - 15

Опубликована: Март 11, 2025

Large language models (LLMs) have emerged as powerful tools for (semi-)automating the initial screening of abstracts in systematic reviews, offering potential to significantly reduce manual burden on research teams. This paper provides a broad overview prompt engineering principles and highlights how traditional PICO (Population, Intervention, Comparison, Outcome) criteria can be converted into actionable instructions LLMs. We analyze trade-offs between “soft” prompts, which maximize recall by accepting articles unless they explicitly fail an inclusion requirement, “strict” demand explicit evidence every criterion. Using periodontics case study, we illustrate design affects recall, precision, overall efficiency discuss metrics (accuracy, F1 score) evaluate performance. also examine common pitfalls, such overly lengthy prompts or ambiguous instructions, underscore continuing need expert oversight mitigate hallucinations biases inherent LLM outputs. Finally, explore emerging trends, including multi-stage pipelines fine-tuning, while noting ethical considerations related data privacy transparency. By applying rigorous evaluation, researchers optimize LLM-based processes, allowing faster more comprehensive synthesis across biomedical disciplines.

Язык: Английский

Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study DOI Creative Commons
Eddie Guo, Mehul Gupta, Jiawen Deng

и другие.

Journal of Medical Internet Research, Год журнала: 2023, Номер 26, С. e48996 - e48996

Опубликована: Сен. 28, 2023

The systematic review of clinical research papers is a labor-intensive and time-consuming process that often involves the screening thousands titles abstracts. accuracy efficiency this are critical for quality subsequent health care decisions. Traditional methods rely heavily on human reviewers, requiring significant investment time resources.

Язык: Английский

Процитировано

53

Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain DOI Creative Commons
Fabio Dennstädt, Johannes Zink, Paul Martin Putora

и другие.

Systematic Reviews, Год журнала: 2024, Номер 13(1)

Опубликована: Июнь 15, 2024

Abstract Background Systematically screening published literature to determine the relevant publications synthesize in a review is time-consuming and difficult task. Large language models (LLMs) are an emerging technology with promising capabilities for automation of language-related tasks that may be useful such purpose. Methods LLMs were used as part automated system evaluate relevance certain topic based on defined criteria title abstract each publication. A Python script was created generate structured prompts consisting text strings instruction, title, abstract, provided LLM. The publication evaluated by LLM Likert scale (low high relevance). By specifying threshold, different classifiers inclusion/exclusion could then defined. approach four openly available ten data sets biomedical reviews newly human-created set hypothetical new systematic review. Results performance varied depending being analyzed. Regarding sensitivity/specificity, yielded 94.48%/31.78% FlanT5 model, 97.58%/19.12% OpenHermes-NeuralChat 81.93%/75.19% Mixtral model 97.58%/38.34% Platypus 2 sets. same 100% sensitivity at specificity 12.58%, 4.54%, 62.47%, 24.74% set. Changing standard settings (minor adaption instruction prompt and/or changing range from 1–5 1–10) had considerable impact performance. Conclusions can scientific show some results. To date, little known about how well systems would perform if prospectively when conducting what further implications this might have. However, it likely future researchers will increasingly use evaluating classifying publications.

Язык: Английский

Процитировано

23

Sensitivity and Specificity of Using GPT-3.5 Turbo Models for Title and Abstract Screening in Systematic Reviews and Meta-analyses DOI
Viet-Thi Tran, Gerald Gartlehner, Sally Yaacoub

и другие.

Annals of Internal Medicine, Год журнала: 2024, Номер 177(6), С. 791 - 799

Опубликована: Май 20, 2024

Systematic reviews are performed manually despite the exponential growth of scientific literature.

Язык: Английский

Процитировано

16

Generative artificial intelligence in graduate medical education DOI Creative Commons

Ravi Janumpally,

Suparna Nanua,

Andy Ngo

и другие.

Frontiers in Medicine, Год журнала: 2025, Номер 11

Опубликована: Янв. 10, 2025

Generative artificial intelligence (GenAI) is rapidly transforming various sectors, including healthcare and education. This paper explores the potential opportunities risks of GenAI in graduate medical education (GME). We review existing literature provide commentary on how could impact GME, five key areas opportunity: electronic health record (EHR) workload reduction, clinical simulation, individualized education, research analytics support, decision support. then discuss significant risks, inaccuracy overreliance AI-generated content, challenges to authenticity academic integrity, biases AI outputs, privacy concerns. As technology matures, it will likely come have an important role future but its integration should be guided by a thorough understanding both benefits limitations.

Язык: Английский

Процитировано

2

Human-Comparable Sensitivity of Large Language Models in Identifying Eligible Studies Through Title and Abstract Screening: 3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews DOI Creative Commons
Kentaro Matsui, Tomohiro Utsumi, Yumi Aoki

и другие.

Journal of Medical Internet Research, Год журнала: 2024, Номер 26, С. e52758 - e52758

Опубликована: Авг. 16, 2024

Background The screening process for systematic reviews is resource-intensive. Although previous machine learning solutions have reported reductions in workload, they risked excluding relevant papers. Objective We evaluated the performance of a 3-layer method using GPT-3.5 and GPT-4 to streamline title abstract-screening reviews. Our goal develop that maximizes sensitivity identifying records. Methods conducted screenings on 2 our related treatment bipolar disorder, with 1381 records from first review 3146 second. Screenings were (gpt-3.5-turbo-0125) (gpt-4-0125-preview) across three layers: (1) research design, (2) target patients, (3) interventions controls. was prompts tailored each study. During this process, information extraction according study’s inclusion criteria optimization carried out GPT-4–based flow without manual adjustments. Records at layer, those meeting all layers subsequently judged as included. Results On both able about 110 per minute, total time required second studies approximately 1 hour hours, respectively. In study, sensitivities/specificities 0.900/0.709 0.806/0.996, Both by 6 used meta-analysis 0.958/0.116 0.875/0.855, sensitivities align human evaluators: 0.867-1.000 study 0.776-0.979 9 After accounting justifiably excluded GPT-4, 0.962/0.996 0.943/0.855 Further investigation indicated cases incorrectly due lack domain knowledge, while misinterpretations criteria. Conclusions demonstrated acceptable level specificity supports its practical application screenings. Future should aim generalize approach explore effectiveness diverse settings, medical nonmedical, fully establish use operational feasibility.

Язык: Английский

Процитировано

10

Potential Roles of Large Language Models in the Production of Systematic Reviews and Meta-Analyses DOI Creative Commons
Xufei Luo,

Fengxian Chen,

Di Zhu

и другие.

Journal of Medical Internet Research, Год журнала: 2024, Номер 26, С. e56780 - e56780

Опубликована: Май 31, 2024

Large language models (LLMs) such as ChatGPT have become widely applied in the field of medical research. In process conducting systematic reviews, similar tools can be used to expedite various steps, including defining clinical questions, performing literature search, document screening, information extraction, and refinement, thereby conserving resources enhancing efficiency. However, when using LLMs, attention should paid transparent reporting, distinguishing between genuine false content, avoiding academic misconduct. this viewpoint, we highlight potential roles LLMs creation reviews meta-analyses, elucidating their advantages, limitations, future research directions, aiming provide insights guidance for authors planning meta-analyses.

Язык: Английский

Процитировано

9

Development of Prompt Templates for Large Language Model–Driven Screening in Systematic Reviews DOI
Christian Cao,

Jason Sang,

Rohit Arora

и другие.

Annals of Internal Medicine, Год журнала: 2025, Номер unknown

Опубликована: Фев. 24, 2025

Systematic reviews (SRs) are hindered by the initial rigorous article screen, which delays access to reliable information synthesis. To develop generic prompt templates for large language model (LLM)-driven abstract and full-text screening that can be adapted different reviews. Diagnostic test accuracy. 48 425 citations were tested across 10 SRs. Full-text evaluated all 12 690 freely available articles from original search. Prompt development used GPT4-0125-preview (OpenAI). None. Large models prompted include or exclude based on SR eligibility criteria. Model outputs compared with author decisions after evaluate performance (accuracy, sensitivity, specificity). Optimized prompts using achieved a weighted sensitivity of 97.7% (range, 86.7% 100%) specificity 85.2% 68.3% 95.9%) in 96.5% 89.7% 100.0%) 91.2% 80.7% In contrast, zero-shot had poor (49.0% abstract, 49.1% full-text). Across LLMs, Claude-3.5 (Anthropic) GPT4 variants similar performance, whereas Gemini Pro (Google) GPT3.5 (OpenAI) underperformed. Direct costs 000 differed substantially: Where single human was estimated require more than 83 hours $1666.67 USD, our LLM-based approach completed under 1 day $157.02 USD. Further optimizations may exist. Retrospective study. Convenience sample evaluations limited free PubMed Central articles. A achieving high other SRs LLMs developed. Our prompting innovations have value investigators researchers conducting criteria-based tasks medical sciences.

Язык: Английский

Процитировано

1

Enhancing title and abstract screening for systematic reviews with GPT-3.5 turbo DOI Creative Commons
Omid Kohandel Gargari, Mohammad Hossein Mahmoudi, Mahsa Hajisafarali

и другие.

BMJ evidence-based medicine, Год журнала: 2023, Номер 29(1), С. 69 - 70

Опубликована: Ноя. 21, 2023

Язык: Английский

Процитировано

18

Artificial intelligence in clinical practice: A look at ChatGPT DOI Open Access
Jiawen Deng, Kiyan Heybati, Ye‐Jean Park

и другие.

Cleveland Clinic Journal of Medicine, Год журнала: 2024, Номер 91(3), С. 173 - 180

Опубликована: Март 1, 2024

Язык: Английский

Процитировано

7

Prompting is all you need: LLMs for systematic review screening DOI Creative Commons
Christian Cao,

Jason Sang,

Rohit Arora

и другие.

medRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Июнь 3, 2024

Abstract Systematic reviews (SRs) are the highest standard of evidence, shaping clinical practice guidelines, policy decisions, and research priorities. However, their labor-intensive nature, including an initial rigorous article screen by at least two investigators, delays access to reliable information synthesis. Here, we demonstrate that large language models (LLMs) with intentional prompting can match human screening performance. We introduce Framework Chain-of-Thought, a novel approach directs LLMs systematically reason against predefined frameworks. evaluated our prompts across ten SRs covering four common types SR questions (i.e., prevalence, intervention benefits, diagnostic test accuracy, prognosis), achieving mean accuracy 93.6% (range: 83.3-99.6%) sensitivity 97.5% (89.7-100%) in full-text screening. Compared experienced reviewers (mean 92.4% [76.8-97.8%], 75.1% [44.1-100%]), prompt demonstrated significantly higher (p<0.05), one review comparable five (p>0.05). While traditional for 7000 articles required 530 hours $10,000 USD, completed day $430 USD. Our results establish perform performance matching experts, setting foundation end-to-end automated SRs.

Язык: Английский

Процитировано

6