LLMscreen: A Python Package for Systematic Review Screening of Scientific Texts Using Prompt Engineering DOI
Ziqian Xia, Jinquan Ye, Bo Hu

и другие.

Research Square (Research Square), Год журнала: 2024, Номер unknown

Опубликована: Сен. 11, 2024

Abstract Systematic reviews represent a cornerstone of evidence-based research, yet the process is labor-intensive and time-consuming, often requiring substantial human resources. The advent Large Language Models (LLMs) offers novel approach to streamlining systematic reviews, particularly in title abstract screening phase. This study introduces new Python package built on LLMs accelerate this process, evaluating its performance across three datasets using distinct prompt strategies: single-prompt, k-value setting, zero-shot. setting emerged as most effective, achieving precision 0.649 reducing average error rate 0.4%, significantly lower than 10.76% typically observed among reviewers. Moreover, enabled 3,000 papers under 8 minutes, at cost only $0.30—an over 250-fold improvement time 2,000-fold efficiency compared traditional methods. These findings underscore potential enhance accuracy though further research needed address challenges related dataset variability model transparency. Expanding application other stages such data extraction synthesis, could streamline review making it more comprehensive less burdensome for researchers.

Язык: Английский

Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals DOI Creative Commons
Inbar Levkovich

European Journal of Investigation in Health Psychology and Education, Год журнала: 2025, Номер 15(1), С. 9 - 9

Опубликована: Янв. 18, 2025

Large language models (LLMs) offer promising possibilities in mental health, yet their ability to assess disorders and recommend treatments remains underexplored. This quantitative cross-sectional study evaluated four LLMs (Gemini 2.0 Flash Experimental), Claude (Claude 3.5 Sonnet), ChatGPT-3.5, ChatGPT-4) using text vignettes representing conditions such as depression, suicidal ideation, early chronic schizophrenia, social phobia, PTSD. Each model’s diagnostic accuracy, treatment recommendations, predicted outcomes were compared with norms established by health professionals. Findings indicated that for certain conditions, including depression PTSD, like ChatGPT-4 achieved higher accuracy human However, more complex cases, LLM performance varied, achieving only 55% while other professionals performed better. tended suggest a broader range of proactive treatments, whereas recommended targeted psychiatric consultations specific medications. In terms outcome predictions, generally optimistic regarding full recovery, especially treatment, lower recovery rates partial rates, particularly untreated cases. While range, conservative highlight the need professional oversight. provide valuable support diagnostics planning but cannot replace discretion.

Язык: Английский

Процитировано

1

High-performance automated abstract screening with large language model ensembles DOI Creative Commons
Rohan Sanghera, Arun James Thirunavukarasu,

Marc Khoury

и другие.

Journal of the American Medical Informatics Association, Год журнала: 2025, Номер unknown

Опубликована: Март 22, 2025

Abstract Objective screening is a labor-intensive component of systematic review involving repetitive application inclusion and exclusion criteria on large volume studies. We aimed to validate language models (LLMs) used automate abstract screening. Materials Methods LLMs (GPT-3.5 Turbo, GPT-4 GPT-4o, Llama 3 70B, Gemini 1.5 Pro, Claude Sonnet 3.5) were trialed across 23 Cochrane Library reviews evaluate their accuracy in zero-shot binary classification for Initial evaluation balanced development dataset (n = 800) identified optimal prompting strategies, the best performing LLM-prompt combinations then validated comprehensive replicated search results 119 695). Results On dataset, exhibited superior performance human researchers terms sensitivity (LLMmax 1.000, humanmax 0.775), precision 0.927, 0.911), 0.904, 0.865). When evaluated consistent (range 0.756-1.000) but diminished 0.004-0.096) due class imbalance. In addition, 66 LLM-human LLM-LLM ensembles perfect with maximal 0.458 decreasing 0.1450 over dataset; conferring workload reductions ranging between 37.55% 99.11%. Discussion Automated can reduce while maintaining quality. Performance variation highlights importance domain-specific validation before autonomous deployment. achieve similar benefits oversight all records. Conclusion may labor cost maintained or improved accuracy, thereby increasing efficiency quality evidence synthesis.

Язык: Английский

Процитировано

1

Integrating Artificial Intelligence into Causal Research in Epidemiology DOI Creative Commons
Ellicott C. Matthay, Daniel B. Neill, Andrea R. Titus

и другие.

Current Epidemiology Reports, Год журнала: 2025, Номер 12(1)

Опубликована: Март 24, 2025

Язык: Английский

Процитировано

1

Opportunities, challenges and risks of using artificial intelligence for evidence synthesis DOI
Waldemar Siemens, Erik von Elm, Harald Binder

и другие.

BMJ evidence-based medicine, Год журнала: 2025, Номер unknown, С. bmjebm - 113320

Опубликована: Янв. 9, 2025

Язык: Английский

Процитировано

0

GPT-3.5 Turbo and GPT-4 Turbo in Title and Abstract Screening for Systematic Reviews DOI Creative Commons
Takehiko Oami, Yohei Okada, Taka‐aki Nakada

и другие.

JMIR Medical Informatics, Год журнала: 2025, Номер 13, С. e64682 - e64682

Опубликована: Март 12, 2025

Abstract This study demonstrated that while GPT-4 Turbo had superior specificity when compared to GPT-3.5 (0.98 vs 0.51), as well comparable sensitivity (0.85 0.83), processed 100 studies faster (0.9 min 1.6 min) in citation screening for systematic reviews, suggesting may be more suitable due its higher and highlighting the potential of large language models optimizing literature selection.

Язык: Английский

Процитировано

0

AI‐Empowered Evidence‐Based Research and Clinical Decision‐Making DOI
Xufei Luo, Long Ge, Lu Zhang

и другие.

Journal of Evidence-Based Medicine, Год журнала: 2025, Номер 18(1)

Опубликована: Март 1, 2025

Язык: Английский

Процитировано

0

Large Language Models and Their Applications in Drug Discovery and Development: A Primer DOI Creative Commons
James Lu, Keunwoo Choi, Maksim Eremeev

и другие.

Clinical and Translational Science, Год журнала: 2025, Номер 18(4)

Опубликована: Апрель 1, 2025

ABSTRACT Large language models (LLMs) have emerged as powerful tools in many fields, including clinical pharmacology and translational medicine. This paper aims to provide a comprehensive primer on the applications of LLMs these disciplines. We will explore fundamental concepts LLMs, their potential drug discovery development processes ranging from facilitating target identification aiding preclinical research trial analysis, practical use cases such assisting with medical writing accelerating analytical workflows quantitative pharmacology. By end this paper, pharmacologists scientists clearer understanding how leverage enhance efforts.

Язык: Английский

Процитировано

0

Large language model-generated clinical practice guideline for appendicitis DOI

Amy Boyle,

Bright Huo, Patricia Sylla

и другие.

Surgical Endoscopy, Год журнала: 2025, Номер unknown

Опубликована: Апрель 18, 2025

Язык: Английский

Процитировано

0

LLMscreen: A Python Package for Systematic Review Screening of Scientific Texts Using Prompt Engineering DOI
Ziqian Xia, Jinquan Ye, Bo Hu

и другие.

Research Square (Research Square), Год журнала: 2024, Номер unknown

Опубликована: Сен. 11, 2024

Abstract Systematic reviews represent a cornerstone of evidence-based research, yet the process is labor-intensive and time-consuming, often requiring substantial human resources. The advent Large Language Models (LLMs) offers novel approach to streamlining systematic reviews, particularly in title abstract screening phase. This study introduces new Python package built on LLMs accelerate this process, evaluating its performance across three datasets using distinct prompt strategies: single-prompt, k-value setting, zero-shot. setting emerged as most effective, achieving precision 0.649 reducing average error rate 0.4%, significantly lower than 10.76% typically observed among reviewers. Moreover, enabled 3,000 papers under 8 minutes, at cost only $0.30—an over 250-fold improvement time 2,000-fold efficiency compared traditional methods. These findings underscore potential enhance accuracy though further research needed address challenges related dataset variability model transparency. Expanding application other stages such data extraction synthesis, could streamline review making it more comprehensive less burdensome for researchers.

Язык: Английский

Процитировано

0