Citation screening using large language models for creating clinical practice guidelines: A protocol for a prospective study DOI Open Access
Takehiko Oami, Yohei Okada, Taka‐aki Nakada

et al.

medRxiv (Cold Spring Harbor Laboratory), Journal Year: 2023, Volume and Issue: unknown

Published: Dec. 31, 2023

Abstract Background The development of clinical practice guidelines requires a meticulous literature search and screening process. This study aims to explore the potential large language models in Japanese Clinical Practice Guidelines for Management Sepsis Septic Shock (J-SSCG), focusing on enhancing quality reducing citation workload. Methods A prospective will be conducted compare efficiency accuracy between conventional method novel approach using models. We use model, namely GPT-4, conduct searches predefined questions. objectively measure time required it taken method. Following screening, we calculate sensitivity specificity results obtained from models-assisted total spent both approaches also compared assess workload reduction. Trial registration research is submitted with University hospital medical information network trial registry (UMIN-CTR) [UMIN000053091]. Conflicts interest All authors declare no conflicts have. Funding None

Language: Английский

Large Language Models in Mental Health Care: A Systematic Scoping Review (Preprint) DOI
Yining Hua, Fenglin Liu, Kailai Yang

et al.

Published: July 8, 2024

BACKGROUND The integration of large language models (LLMs) in mental health care is an emerging field. There a need to systematically review the application outcomes and delineate advantages limitations clinical settings. OBJECTIVE This aims provide comprehensive overview use LLMs care, assessing their efficacy, challenges, potential for future applications. METHODS A systematic search was conducted across multiple databases including PubMed, Web Science, Google Scholar, arXiv, medRxiv, PsyArXiv November 2023. All forms original research, peer-reviewed or not, published disseminated between October 1, 2019, December 2, 2023, are included without restrictions if they used developed after T5 directly addressed research questions RESULTS From initial pool 313 articles, 34 met inclusion criteria based on relevance LLM robustness reported outcomes. Diverse applications identified, diagnosis, therapy, patient engagement enhancement, etc. Key challenges include data availability reliability, nuanced handling states, effective evaluation methods. Despite successes accuracy accessibility improvement, gaps applicability ethical considerations were evident, pointing robust data, standardized evaluations, interdisciplinary collaboration. CONCLUSIONS hold substantial promise enhancing care. For full be realized, emphasis must placed developing datasets, development frameworks, guidelines, collaborations address current limitations.

Language: Английский

Citations

13

Performance of a Large Language Model in Screening Citations DOI Creative Commons
Takehiko Oami, Yohei Okada, Taka‐aki Nakada

et al.

JAMA Network Open, Journal Year: 2024, Volume and Issue: 7(7), P. e2420496 - e2420496

Published: July 8, 2024

Importance Large language models (LLMs) are promising as tools for citation screening in systematic reviews. However, their applicability has not yet been determined. Objective To evaluate the accuracy and efficiency of an LLM title abstract literature screening. Design, Setting, Participants This prospective diagnostic study used data from process 5 clinical questions (CQs) development Japanese Clinical Practice Guidelines Management Sepsis Septic Shock. The decided to include or exclude citations based on inclusion exclusion criteria terms patient, population, problem; intervention; comparison; design selected CQ was compared with conventional method conducted January 7 15, 2024. Exposures (GPT-4 Turbo)–assisted method. Main Outcomes Measures sensitivity specificity LLM-assisted calculated, full-text result using set reference standard primary analysis. Pooled were also estimated, times 2 methods compared. Results In process, 8 5634 publications 1, 4 3418 2, 1038 3, 17 4326 4, 2253 selected. analysis CQs, demonstrated integrated 0.75 (95% CI, 0.43 0.92) 0.99 0.99). Post hoc modifications command prompt improved 0.91 0.77 0.97) without substantially compromising (0.98 [95% 0.96 0.99]). Additionally, associated reduced time processing 100 studies (1.3 minutes vs 17.2 methods; mean difference, −15.25 −17.70 −12.79 minutes]). Conclusions Relevance this investigating performance screening, model acceptable reasonably high time. novel could potentially enhance reduce workload

Language: Английский

Citations

11

Human-Comparable Sensitivity of Large Language Models in Identifying Eligible Studies Through Title and Abstract Screening: 3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews DOI Creative Commons
Kentaro Matsui, Tomohiro Utsumi, Yumi Aoki

et al.

Journal of Medical Internet Research, Journal Year: 2024, Volume and Issue: 26, P. e52758 - e52758

Published: Aug. 16, 2024

Background The screening process for systematic reviews is resource-intensive. Although previous machine learning solutions have reported reductions in workload, they risked excluding relevant papers. Objective We evaluated the performance of a 3-layer method using GPT-3.5 and GPT-4 to streamline title abstract-screening reviews. Our goal develop that maximizes sensitivity identifying records. Methods conducted screenings on 2 our related treatment bipolar disorder, with 1381 records from first review 3146 second. Screenings were (gpt-3.5-turbo-0125) (gpt-4-0125-preview) across three layers: (1) research design, (2) target patients, (3) interventions controls. was prompts tailored each study. During this process, information extraction according study’s inclusion criteria optimization carried out GPT-4–based flow without manual adjustments. Records at layer, those meeting all layers subsequently judged as included. Results On both able about 110 per minute, total time required second studies approximately 1 hour hours, respectively. In study, sensitivities/specificities 0.900/0.709 0.806/0.996, Both by 6 used meta-analysis 0.958/0.116 0.875/0.855, sensitivities align human evaluators: 0.867-1.000 study 0.776-0.979 9 After accounting justifiably excluded GPT-4, 0.962/0.996 0.943/0.855 Further investigation indicated cases incorrectly due lack domain knowledge, while misinterpretations criteria. Conclusions demonstrated acceptable level specificity supports its practical application screenings. Future should aim generalize approach explore effectiveness diverse settings, medical nonmedical, fully establish use operational feasibility.

Language: Английский

Citations

10

Potential Roles of Large Language Models in the Production of Systematic Reviews and Meta-Analyses DOI Creative Commons
Xufei Luo,

Fengxian Chen,

Di Zhu

et al.

Journal of Medical Internet Research, Journal Year: 2024, Volume and Issue: 26, P. e56780 - e56780

Published: May 31, 2024

Large language models (LLMs) such as ChatGPT have become widely applied in the field of medical research. In process conducting systematic reviews, similar tools can be used to expedite various steps, including defining clinical questions, performing literature search, document screening, information extraction, and refinement, thereby conserving resources enhancing efficiency. However, when using LLMs, attention should paid transparent reporting, distinguishing between genuine false content, avoiding academic misconduct. this viewpoint, we highlight potential roles LLMs creation reviews meta-analyses, elucidating their advantages, limitations, future research directions, aiming provide insights guidance for authors planning meta-analyses.

Language: Английский

Citations

9

Development of Prompt Templates for Large Language Model–Driven Screening in Systematic Reviews DOI
Christian Cao,

Jason Sang,

Rohit Arora

et al.

Annals of Internal Medicine, Journal Year: 2025, Volume and Issue: unknown

Published: Feb. 24, 2025

Systematic reviews (SRs) are hindered by the initial rigorous article screen, which delays access to reliable information synthesis. To develop generic prompt templates for large language model (LLM)-driven abstract and full-text screening that can be adapted different reviews. Diagnostic test accuracy. 48 425 citations were tested across 10 SRs. Full-text evaluated all 12 690 freely available articles from original search. Prompt development used GPT4-0125-preview (OpenAI). None. Large models prompted include or exclude based on SR eligibility criteria. Model outputs compared with author decisions after evaluate performance (accuracy, sensitivity, specificity). Optimized prompts using achieved a weighted sensitivity of 97.7% (range, 86.7% 100%) specificity 85.2% 68.3% 95.9%) in 96.5% 89.7% 100.0%) 91.2% 80.7% In contrast, zero-shot had poor (49.0% abstract, 49.1% full-text). Across LLMs, Claude-3.5 (Anthropic) GPT4 variants similar performance, whereas Gemini Pro (Google) GPT3.5 (OpenAI) underperformed. Direct costs 000 differed substantially: Where single human was estimated require more than 83 hours $1666.67 USD, our LLM-based approach completed under 1 day $157.02 USD. Further optimizations may exist. Retrospective study. Convenience sample evaluations limited free PubMed Central articles. A achieving high other SRs LLMs developed. Our prompting innovations have value investigators researchers conducting criteria-based tasks medical sciences.

Language: Английский

Citations

1

High-performance automated abstract screening with large language model ensembles DOI Creative Commons
Rohan Sanghera, Arun James Thirunavukarasu,

Marc Khoury

et al.

Journal of the American Medical Informatics Association, Journal Year: 2025, Volume and Issue: unknown

Published: March 22, 2025

Abstract Objective screening is a labor-intensive component of systematic review involving repetitive application inclusion and exclusion criteria on large volume studies. We aimed to validate language models (LLMs) used automate abstract screening. Materials Methods LLMs (GPT-3.5 Turbo, GPT-4 GPT-4o, Llama 3 70B, Gemini 1.5 Pro, Claude Sonnet 3.5) were trialed across 23 Cochrane Library reviews evaluate their accuracy in zero-shot binary classification for Initial evaluation balanced development dataset (n = 800) identified optimal prompting strategies, the best performing LLM-prompt combinations then validated comprehensive replicated search results 119 695). Results On dataset, exhibited superior performance human researchers terms sensitivity (LLMmax 1.000, humanmax 0.775), precision 0.927, 0.911), 0.904, 0.865). When evaluated consistent (range 0.756-1.000) but diminished 0.004-0.096) due class imbalance. In addition, 66 LLM-human LLM-LLM ensembles perfect with maximal 0.458 decreasing 0.1450 over dataset; conferring workload reductions ranging between 37.55% 99.11%. Discussion Automated can reduce while maintaining quality. Performance variation highlights importance domain-specific validation before autonomous deployment. achieve similar benefits oversight all records. Conclusion may labor cost maintained or improved accuracy, thereby increasing efficiency quality evidence synthesis.

Language: Английский

Citations

1

Evaluating the OpenAI’s GPT-3.5 Turbo’s performance in extracting information from scientific articles on diabetic retinopathy DOI Creative Commons

Celeste Ci Ying Gue,

Noorul Dharajath Abdul Rahim,

William Rojas‐Carabali

et al.

Systematic Reviews, Journal Year: 2024, Volume and Issue: 13(1)

Published: May 16, 2024

Abstract We aimed to compare the concordance of information extracted and time taken between a large language model (OpenAI’s GPT-3.5 Turbo via API) against conventional human extraction methods in retrieving from scientific articles on diabetic retinopathy (DR). The was done using GPT3.5 as October 2023. OpenAI’s significantly reduced for extraction. Concordance highest at 100% country study, 64.7% significant risk factors DR, 47.1% exclusion inclusion criteria, lastly 41.2% odds ratio (OR) 95% confidence interval (CI). levels seemed indicate complexity associated with each prompt. This suggests that may be adopted extract simple is easily located text, leaving more complex by researcher. It crucial note foundation constantly improving new versions being released quickly. Subsequent work can focus retrieval-augmented generation (RAG), embedding, chunking PDF into useful sections, prompting improve accuracy

Language: Английский

Citations

8

Diagnostic accuracy of large language models in psychiatry DOI
Omid Kohandel Gargari, Farhad Fatehi, Ida Mohammadi

et al.

Asian Journal of Psychiatry, Journal Year: 2024, Volume and Issue: 100, P. 104168 - 104168

Published: July 25, 2024

Language: Английский

Citations

7

Prompting is all you need: LLMs for systematic review screening DOI Creative Commons
Christian Cao,

Jason Sang,

Rohit Arora

et al.

medRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: June 3, 2024

Abstract Systematic reviews (SRs) are the highest standard of evidence, shaping clinical practice guidelines, policy decisions, and research priorities. However, their labor-intensive nature, including an initial rigorous article screen by at least two investigators, delays access to reliable information synthesis. Here, we demonstrate that large language models (LLMs) with intentional prompting can match human screening performance. We introduce Framework Chain-of-Thought, a novel approach directs LLMs systematically reason against predefined frameworks. evaluated our prompts across ten SRs covering four common types SR questions (i.e., prevalence, intervention benefits, diagnostic test accuracy, prognosis), achieving mean accuracy 93.6% (range: 83.3-99.6%) sensitivity 97.5% (89.7-100%) in full-text screening. Compared experienced reviewers (mean 92.4% [76.8-97.8%], 75.1% [44.1-100%]), prompt demonstrated significantly higher (p<0.05), one review comparable five (p>0.05). While traditional for 7000 articles required 530 hours $10,000 USD, completed day $430 USD. Our results establish perform performance matching experts, setting foundation end-to-end automated SRs.

Language: Английский

Citations

6

Implementation and evaluation of an additional GPT-4-based reviewer in PRISMA-based medical systematic literature reviews DOI Creative Commons
Assaf Landschaft, Dario Antweiler, Sina Mackay

et al.

International Journal of Medical Informatics, Journal Year: 2024, Volume and Issue: 189, P. 105531 - 105531

Published: June 26, 2024

PRISMA-based literature reviews require meticulous scrutiny of extensive textual data by multiple reviewers, which is associated with considerable human effort.

Language: Английский

Citations

4