Cited by How to Write Effective Prompts for Screening Biomedical Literature Using Large Language Models

Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study DOI

Eddie Guo, Mehul Gupta, Jiawen Deng

и другие.

Journal of Medical Internet Research, Год журнала: 2023, Номер 26, С. e48996 - e48996

Опубликована: Сен. 28, 2023

The systematic review of clinical research papers is a labor-intensive and time-consuming process that often involves the screening thousands titles abstracts. accuracy efficiency this are critical for quality subsequent health care decisions. Traditional methods rely heavily on human reviewers, requiring significant investment time resources.

Язык: Английский

Процитировано

Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain DOI

Fabio Dennstädt, Johannes Zink, Paul Martin Putora

и другие.

Systematic Reviews, Год журнала: 2024, Номер 13(1)

Опубликована: Июнь 15, 2024

Abstract Background Systematically screening published literature to determine the relevant publications synthesize in a review is time-consuming and difficult task. Large language models (LLMs) are an emerging technology with promising capabilities for automation of language-related tasks that may be useful such purpose. Methods LLMs were used as part automated system evaluate relevance certain topic based on defined criteria title abstract each publication. A Python script was created generate structured prompts consisting text strings instruction, title, abstract, provided LLM. The publication evaluated by LLM Likert scale (low high relevance). By specifying threshold, different classifiers inclusion/exclusion could then defined. approach four openly available ten data sets biomedical reviews newly human-created set hypothetical new systematic review. Results performance varied depending being analyzed. Regarding sensitivity/specificity, yielded 94.48%/31.78% FlanT5 model, 97.58%/19.12% OpenHermes-NeuralChat 81.93%/75.19% Mixtral model 97.58%/38.34% Platypus 2 sets. same 100% sensitivity at specificity 12.58%, 4.54%, 62.47%, 24.74% set. Changing standard settings (minor adaption instruction prompt and/or changing range from 1–5 1–10) had considerable impact performance. Conclusions can scientific show some results. To date, little known about how well systems would perform if prospectively when conducting what further implications this might have. However, it likely future researchers will increasingly use evaluating classifying publications.

Язык: Английский

Процитировано

Sensitivity and Specificity of Using GPT-3.5 Turbo Models for Title and Abstract Screening in Systematic Reviews and Meta-analyses DOI

Viet-Thi Tran, Gerald Gartlehner, Sally Yaacoub

и другие.

Annals of Internal Medicine, Год журнала: 2024, Номер 177(6), С. 791 - 799

Опубликована: Май 20, 2024

Systematic reviews are performed manually despite the exponential growth of scientific literature.

Язык: Английский

Процитировано

Generative artificial intelligence in graduate medical education DOI

Ravi Janumpally,

Suparna Nanua,

Andy Ngo

и другие.

Frontiers in Medicine, Год журнала: 2025, Номер 11

Опубликована: Янв. 10, 2025

Generative artificial intelligence (GenAI) is rapidly transforming various sectors, including healthcare and education. This paper explores the potential opportunities risks of GenAI in graduate medical education (GME). We review existing literature provide commentary on how could impact GME, five key areas opportunity: electronic health record (EHR) workload reduction, clinical simulation, individualized education, research analytics support, decision support. then discuss significant risks, inaccuracy overreliance AI-generated content, challenges to authenticity academic integrity, biases AI outputs, privacy concerns. As technology matures, it will likely come have an important role future but its integration should be guided by a thorough understanding both benefits limitations.

Язык: Английский

Процитировано

Human-Comparable Sensitivity of Large Language Models in Identifying Eligible Studies Through Title and Abstract Screening: 3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews DOI

Kentaro Matsui, Tomohiro Utsumi, Yumi Aoki

и другие.

Journal of Medical Internet Research, Год журнала: 2024, Номер 26, С. e52758 - e52758

Опубликована: Авг. 16, 2024

Background The screening process for systematic reviews is resource-intensive. Although previous machine learning solutions have reported reductions in workload, they risked excluding relevant papers. Objective We evaluated the performance of a 3-layer method using GPT-3.5 and GPT-4 to streamline title abstract-screening reviews. Our goal develop that maximizes sensitivity identifying records. Methods conducted screenings on 2 our related treatment bipolar disorder, with 1381 records from first review 3146 second. Screenings were (gpt-3.5-turbo-0125) (gpt-4-0125-preview) across three layers: (1) research design, (2) target patients, (3) interventions controls. was prompts tailored each study. During this process, information extraction according study’s inclusion criteria optimization carried out GPT-4–based flow without manual adjustments. Records at layer, those meeting all layers subsequently judged as included. Results On both able about 110 per minute, total time required second studies approximately 1 hour hours, respectively. In study, sensitivities/specificities 0.900/0.709 0.806/0.996, Both by 6 used meta-analysis 0.958/0.116 0.875/0.855, sensitivities align human evaluators: 0.867-1.000 study 0.776-0.979 9 After accounting justifiably excluded GPT-4, 0.962/0.996 0.943/0.855 Further investigation indicated cases incorrectly due lack domain knowledge, while misinterpretations criteria. Conclusions demonstrated acceptable level specificity supports its practical application screenings. Future should aim generalize approach explore effectiveness diverse settings, medical nonmedical, fully establish use operational feasibility.

Язык: Английский

Процитировано

Potential Roles of Large Language Models in the Production of Systematic Reviews and Meta-Analyses DOI

Xufei Luo,

Fengxian Chen,

Di Zhu

и другие.

Journal of Medical Internet Research, Год журнала: 2024, Номер 26, С. e56780 - e56780

Опубликована: Май 31, 2024

Large language models (LLMs) such as ChatGPT have become widely applied in the field of medical research. In process conducting systematic reviews, similar tools can be used to expedite various steps, including defining clinical questions, performing literature search, document screening, information extraction, and refinement, thereby conserving resources enhancing efficiency. However, when using LLMs, attention should paid transparent reporting, distinguishing between genuine false content, avoiding academic misconduct. this viewpoint, we highlight potential roles LLMs creation reviews meta-analyses, elucidating their advantages, limitations, future research directions, aiming provide insights guidance for authors planning meta-analyses.

Язык: Английский

Процитировано

Development of Prompt Templates for Large Language Model–Driven Screening in Systematic Reviews DOI

Christian Cao,

Jason Sang,

Rohit Arora

и другие.

Annals of Internal Medicine, Год журнала: 2025, Номер unknown

Опубликована: Фев. 24, 2025

Systematic reviews (SRs) are hindered by the initial rigorous article screen, which delays access to reliable information synthesis. To develop generic prompt templates for large language model (LLM)-driven abstract and full-text screening that can be adapted different reviews. Diagnostic test accuracy. 48 425 citations were tested across 10 SRs. Full-text evaluated all 12 690 freely available articles from original search. Prompt development used GPT4-0125-preview (OpenAI). None. Large models prompted include or exclude based on SR eligibility criteria. Model outputs compared with author decisions after evaluate performance (accuracy, sensitivity, specificity). Optimized prompts using achieved a weighted sensitivity of 97.7% (range, 86.7% 100%) specificity 85.2% 68.3% 95.9%) in 96.5% 89.7% 100.0%) 91.2% 80.7% In contrast, zero-shot had poor (49.0% abstract, 49.1% full-text). Across LLMs, Claude-3.5 (Anthropic) GPT4 variants similar performance, whereas Gemini Pro (Google) GPT3.5 (OpenAI) underperformed. Direct costs 000 differed substantially: Where single human was estimated require more than 83 hours $1666.67 USD, our LLM-based approach completed under 1 day $157.02 USD. Further optimizations may exist. Retrospective study. Convenience sample evaluations limited free PubMed Central articles. A achieving high other SRs LLMs developed. Our prompting innovations have value investigators researchers conducting criteria-based tasks medical sciences.

Язык: Английский

Процитировано

Enhancing title and abstract screening for systematic reviews with GPT-3.5 turbo DOI

Omid Kohandel Gargari, Mohammad Hossein Mahmoudi, Mahsa Hajisafarali

и другие.

BMJ evidence-based medicine, Год журнала: 2023, Номер 29(1), С. 69 - 70

Опубликована: Ноя. 21, 2023

Язык: Английский

Процитировано

Artificial intelligence in clinical practice: A look at ChatGPT DOI

Jiawen Deng, Kiyan Heybati, Ye‐Jean Park

и другие.

Cleveland Clinic Journal of Medicine, Год журнала: 2024, Номер 91(3), С. 173 - 180

Опубликована: Март 1, 2024

Язык: Английский

Процитировано

Prompting is all you need: LLMs for systematic review screening DOI

Christian Cao,

Jason Sang,

Rohit Arora

и другие.

medRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Июнь 3, 2024

Abstract Systematic reviews (SRs) are the highest standard of evidence, shaping clinical practice guidelines, policy decisions, and research priorities. However, their labor-intensive nature, including an initial rigorous article screen by at least two investigators, delays access to reliable information synthesis. Here, we demonstrate that large language models (LLMs) with intentional prompting can match human screening performance. We introduce Framework Chain-of-Thought, a novel approach directs LLMs systematically reason against predefined frameworks. evaluated our prompts across ten SRs covering four common types SR questions (i.e., prevalence, intervention benefits, diagnostic test accuracy, prognosis), achieving mean accuracy 93.6% (range: 83.3-99.6%) sensitivity 97.5% (89.7-100%) in full-text screening. Compared experienced reviewers (mean 92.4% [76.8-97.8%], 75.1% [44.1-100%]), prompt demonstrated significantly higher (p<0.05), one review comparable five (p>0.05). While traditional for 7000 articles required 530 hours $10,000 USD, completed day $430 USD. Our results establish perform performance matching experts, setting foundation end-to-end automated SRs.

Язык: Английский

Процитировано