Loon Lens 1.0 Validation: Agentic AI for Title and Abstract Screening in Systematic Literature Reviews DOI Creative Commons
Ghayath Janoudi, Mara Uzun,

Mia Jurdana

и другие.

medRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Сен. 6, 2024

Abstract Introduction Systematic literature reviews (SLRs) are critical for informing clinical research and practice, but they time-consuming resource-intensive, particularly during Title (TiAb) screening. Loon Lens, an autonomous, agentic AI platform, streamlines TiAb screening without the need human reviewers to conduct any Methods This study validates Lens against reviewer decisions across eight SLRs conducted by Canada’s Drug Agency, covering a range of drugs eligibility criteria. A total 3,796 citations were retrieved, with identifying 287 (7.6%) inclusion. autonomously screened same based on provided inclusion exclusion Metrics such as accuracy, recall, precision, F1 score, specificity, negative predictive value (NPV) calculated. Bootstrapping was applied compute 95% confidence intervals. Results achieved accuracy 95.5% (95% CI: 94.8–96.1), recall at 98.95% 97.57–100%) specificity 95.24% 94.54–95.89%). Precision lower 62.97% 58.39–67.27%), suggesting that included more full-text compared reviewers. The score 0.770 0.734–0.802), indicating strong balance between precision recall. Conclusion demonstrates ability substantial potential reducing time cost associated manual or semi-autonomous in SLRs. While improvements needed, platform offers scalable, autonomous solution systematic reviews. Access is available upon request https://loonlens.com/ .

Язык: Английский

High-performance automated abstract screening with large language model ensembles DOI Creative Commons
Rohan Sanghera, Arun James Thirunavukarasu,

Marc Khoury

и другие.

Journal of the American Medical Informatics Association, Год журнала: 2025, Номер unknown

Опубликована: Март 22, 2025

Abstract Objective screening is a labor-intensive component of systematic review involving repetitive application inclusion and exclusion criteria on large volume studies. We aimed to validate language models (LLMs) used automate abstract screening. Materials Methods LLMs (GPT-3.5 Turbo, GPT-4 GPT-4o, Llama 3 70B, Gemini 1.5 Pro, Claude Sonnet 3.5) were trialed across 23 Cochrane Library reviews evaluate their accuracy in zero-shot binary classification for Initial evaluation balanced development dataset (n = 800) identified optimal prompting strategies, the best performing LLM-prompt combinations then validated comprehensive replicated search results 119 695). Results On dataset, exhibited superior performance human researchers terms sensitivity (LLMmax 1.000, humanmax 0.775), precision 0.927, 0.911), 0.904, 0.865). When evaluated consistent (range 0.756-1.000) but diminished 0.004-0.096) due class imbalance. In addition, 66 LLM-human LLM-LLM ensembles perfect with maximal 0.458 decreasing 0.1450 over dataset; conferring workload reductions ranging between 37.55% 99.11%. Discussion Automated can reduce while maintaining quality. Performance variation highlights importance domain-specific validation before autonomous deployment. achieve similar benefits oversight all records. Conclusion may labor cost maintained or improved accuracy, thereby increasing efficiency quality evidence synthesis.

Язык: Английский

Процитировано

1

A comprehensive evaluation of large language models in mining gene relations and pathway knowledge DOI Open Access
Muhammad S. Azam, Yibo Chen, Micheal Olaolu Arowolo

и другие.

Quantitative Biology, Год журнала: 2024, Номер 12(4), С. 360 - 374

Опубликована: Июнь 21, 2024

Understanding complex biological pathways, including gene-gene interactions and gene regulatory networks, is critical for exploring disease mechanisms drug development. Manual literature curation of pathways cannot keep up with the exponential growth new discoveries in literature. Large-scale language models (LLMs) trained on extensive text corpora contain rich information, they can be mined as a knowledge graph. This study assesses 21 LLMs, both application programming interface (API)-based open-source their capacities retrieving knowledge. The evaluation focuses predicting relations (activation, inhibition, phosphorylation) Kyoto Encyclopedia Genes Genomes (KEGG) pathway components. Results indicated significant disparity model performance. API-based GPT-4 Claude-Pro showed superior performance, an F1 score 0.4448 0.4386 relation prediction, Jaccard similarity index 0.2778 0.2657 KEGG respectively. Open-source lagged behind counterparts, whereas Falcon-180b llama2-7b had highest scores 0.2787 0.1923 relations, recognition 0.2237 0.2207 llama2-7b. Our suggests that LLMs are informative network analysis mapping, but effectiveness varies, necessitating careful selection. work also provides case insight into using das graphs. code publicly available at website GitHub (Muh-aza).

Язык: Английский

Процитировано

5

An Informatics Framework for Accelerating Digital Health Technology Enabled Randomized Controlled Trial Candidate Guideline Item Development DOI

Tinsley R. Harrison,

Di Hu, Heling Jia

и другие.

Опубликована: Янв. 1, 2025

Язык: Английский

Процитировано

0

How to Write Effective Prompts for Screening Biomedical Literature Using Large Language Models DOI Creative Commons
Maria Teresa Colangelo, Stefano Guizzardi,

Marco Meleti

и другие.

BioMedInformatics, Год журнала: 2025, Номер 5(1), С. 15 - 15

Опубликована: Март 11, 2025

Large language models (LLMs) have emerged as powerful tools for (semi-)automating the initial screening of abstracts in systematic reviews, offering potential to significantly reduce manual burden on research teams. This paper provides a broad overview prompt engineering principles and highlights how traditional PICO (Population, Intervention, Comparison, Outcome) criteria can be converted into actionable instructions LLMs. We analyze trade-offs between “soft” prompts, which maximize recall by accepting articles unless they explicitly fail an inclusion requirement, “strict” demand explicit evidence every criterion. Using periodontics case study, we illustrate design affects recall, precision, overall efficiency discuss metrics (accuracy, F1 score) evaluate performance. also examine common pitfalls, such overly lengthy prompts or ambiguous instructions, underscore continuing need expert oversight mitigate hallucinations biases inherent LLM outputs. Finally, explore emerging trends, including multi-stage pipelines fine-tuning, while noting ethical considerations related data privacy transparency. By applying rigorous evaluation, researchers optimize LLM-based processes, allowing faster more comprehensive synthesis across biomedical disciplines.

Язык: Английский

Процитировано

0

GPT-3.5 Turbo and GPT-4 Turbo in Title and Abstract Screening for Systematic Reviews DOI Creative Commons
Takehiko Oami, Yohei Okada, Taka‐aki Nakada

и другие.

JMIR Medical Informatics, Год журнала: 2025, Номер 13, С. e64682 - e64682

Опубликована: Март 12, 2025

Abstract This study demonstrated that while GPT-4 Turbo had superior specificity when compared to GPT-3.5 (0.98 vs 0.51), as well comparable sensitivity (0.85 0.83), processed 100 studies faster (0.9 min 1.6 min) in citation screening for systematic reviews, suggesting may be more suitable due its higher and highlighting the potential of large language models optimizing literature selection.

Язык: Английский

Процитировано

0

Validation of large language models (Llama 3 and ChatGPT-4o mini) for title and abstract screening in biomedical systematic reviews DOI Creative Commons
Adriana López‐Pineda, Rauf Nouni-García,

Álvaro Carbonell-Soliva

и другие.

Research Synthesis Methods, Год журнала: 2025, Номер unknown, С. 1 - 11

Опубликована: Март 24, 2025

Abstract With the increasing volume of scientific literature, there is a need to streamline screening process for titles and abstracts in systematic reviews, reduce workload reviewers, minimize errors. This study validated artificial intelligence (AI) tools, specifically Llama 3 70B via Groq’s application programming interface (API) ChatGPT-4o mini OpenAI’s API, automating this biomedical research. It compared these AI tools with human reviewers using 1,081 articles after duplicate removal. Each model was tested three configurations assess sensitivity, specificity, predictive values, likelihood ratios. The model’s LLA_2 configuration achieved 77.5% sensitivity 91.4% 90.2% accuracy, positive value (PPV) 44.3%, negative (NPV) 97.9%. CHAT_2 showed 56.2% 95.1% 92.0% PPV 50.6%, an NPV 96.1%. Both models demonstrated strong having higher overall accuracy. Despite promising results, manual validation remains necessary address false positives negatives, ensuring that no important studies are overlooked. suggests can significantly enhance efficiency accuracy potentially revolutionizing not only research but also other fields requiring extensive literature reviews.

Язык: Английский

Процитировано

0

Uncovering new psychoactive substances research trends using large language model-assisted text mining (LATeM) DOI Creative Commons
Yoshiyuki Kobayashi, Takumi Uchida,

Itsuki Kageyama

и другие.

Journal of Hazardous Materials Advances, Год журнала: 2025, Номер unknown, С. 100700 - 100700

Опубликована: Март 1, 2025

Язык: Английский

Процитировано

0

Accelerating Disease Model Parameter Extraction: An LLM-Based Ranking Approach to Select Initial Studies for Literature Review Automation DOI Creative Commons

Masood Sujau,

Masako Wada, Émilie Vallée

и другие.

Machine Learning and Knowledge Extraction, Год журнала: 2025, Номер 7(2), С. 28 - 28

Опубликована: Март 26, 2025

As climate change transforms our environment and human intrusion into natural ecosystems escalates, there is a growing demand for disease spread models to forecast plan the next zoonotic outbreak. Accurate parametrization of these requires data from diverse sources, including scientific literature. Despite abundance publications, manual extraction via systematic literature reviews remains significant bottleneck, requiring extensive time resources, susceptible error. This study examines application large language model (LLM) as an assessor screening prioritisation in climate-sensitive research. By framing selection criteria articles question–answer task utilising zero-shot chain-of-thought prompting, proposed method achieves saving at least 70% work effort compared recall level 95% (NWSS@95%). was validated across four datasets containing distinct diseases critical variable (rainfall). The approach additionally produces explainable AI rationales each ranked article. effectiveness multiple demonstrates potential broad reviews. substantial reduction effort, along with provision rationales, marks important step toward automated parameter

Язык: Английский

Процитировано

0

Loon Lens 1.0 Validation: Agentic AI for Title and Abstract Screening in Systematic Literature Reviews DOI Creative Commons
Ghayath Janoudi, Mara Uzun,

Mia Jurdana

и другие.

medRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Сен. 6, 2024

Abstract Introduction Systematic literature reviews (SLRs) are critical for informing clinical research and practice, but they time-consuming resource-intensive, particularly during Title (TiAb) screening. Loon Lens, an autonomous, agentic AI platform, streamlines TiAb screening without the need human reviewers to conduct any Methods This study validates Lens against reviewer decisions across eight SLRs conducted by Canada’s Drug Agency, covering a range of drugs eligibility criteria. A total 3,796 citations were retrieved, with identifying 287 (7.6%) inclusion. autonomously screened same based on provided inclusion exclusion Metrics such as accuracy, recall, precision, F1 score, specificity, negative predictive value (NPV) calculated. Bootstrapping was applied compute 95% confidence intervals. Results achieved accuracy 95.5% (95% CI: 94.8–96.1), recall at 98.95% 97.57–100%) specificity 95.24% 94.54–95.89%). Precision lower 62.97% 58.39–67.27%), suggesting that included more full-text compared reviewers. The score 0.770 0.734–0.802), indicating strong balance between precision recall. Conclusion demonstrates ability substantial potential reducing time cost associated manual or semi-autonomous in SLRs. While improvements needed, platform offers scalable, autonomous solution systematic reviews. Access is available upon request https://loonlens.com/ .

Язык: Английский

Процитировано

1