High-performance automated abstract screening with large language model ensembles
Journal of the American Medical Informatics Association,
Год журнала:
2025,
Номер
unknown
Опубликована: Март 22, 2025
Abstract
Objective
screening
is
a
labor-intensive
component
of
systematic
review
involving
repetitive
application
inclusion
and
exclusion
criteria
on
large
volume
studies.
We
aimed
to
validate
language
models
(LLMs)
used
automate
abstract
screening.
Materials
Methods
LLMs
(GPT-3.5
Turbo,
GPT-4
GPT-4o,
Llama
3
70B,
Gemini
1.5
Pro,
Claude
Sonnet
3.5)
were
trialed
across
23
Cochrane
Library
reviews
evaluate
their
accuracy
in
zero-shot
binary
classification
for
Initial
evaluation
balanced
development
dataset
(n
=
800)
identified
optimal
prompting
strategies,
the
best
performing
LLM-prompt
combinations
then
validated
comprehensive
replicated
search
results
119
695).
Results
On
dataset,
exhibited
superior
performance
human
researchers
terms
sensitivity
(LLMmax
1.000,
humanmax
0.775),
precision
0.927,
0.911),
0.904,
0.865).
When
evaluated
consistent
(range
0.756-1.000)
but
diminished
0.004-0.096)
due
class
imbalance.
In
addition,
66
LLM-human
LLM-LLM
ensembles
perfect
with
maximal
0.458
decreasing
0.1450
over
dataset;
conferring
workload
reductions
ranging
between
37.55%
99.11%.
Discussion
Automated
can
reduce
while
maintaining
quality.
Performance
variation
highlights
importance
domain-specific
validation
before
autonomous
deployment.
achieve
similar
benefits
oversight
all
records.
Conclusion
may
labor
cost
maintained
or
improved
accuracy,
thereby
increasing
efficiency
quality
evidence
synthesis.
Язык: Английский
A comprehensive evaluation of large language models in mining gene relations and pathway knowledge
Quantitative Biology,
Год журнала:
2024,
Номер
12(4), С. 360 - 374
Опубликована: Июнь 21, 2024
Understanding
complex
biological
pathways,
including
gene-gene
interactions
and
gene
regulatory
networks,
is
critical
for
exploring
disease
mechanisms
drug
development.
Manual
literature
curation
of
pathways
cannot
keep
up
with
the
exponential
growth
new
discoveries
in
literature.
Large-scale
language
models
(LLMs)
trained
on
extensive
text
corpora
contain
rich
information,
they
can
be
mined
as
a
knowledge
graph.
This
study
assesses
21
LLMs,
both
application
programming
interface
(API)-based
open-source
their
capacities
retrieving
knowledge.
The
evaluation
focuses
predicting
relations
(activation,
inhibition,
phosphorylation)
Kyoto
Encyclopedia
Genes
Genomes
(KEGG)
pathway
components.
Results
indicated
significant
disparity
model
performance.
API-based
GPT-4
Claude-Pro
showed
superior
performance,
an
F1
score
0.4448
0.4386
relation
prediction,
Jaccard
similarity
index
0.2778
0.2657
KEGG
respectively.
Open-source
lagged
behind
counterparts,
whereas
Falcon-180b
llama2-7b
had
highest
scores
0.2787
0.1923
relations,
recognition
0.2237
0.2207
llama2-7b.
Our
suggests
that
LLMs
are
informative
network
analysis
mapping,
but
effectiveness
varies,
necessitating
careful
selection.
work
also
provides
case
insight
into
using
das
graphs.
code
publicly
available
at
website
GitHub
(Muh-aza).
Язык: Английский
An Informatics Framework for Accelerating Digital Health Technology Enabled Randomized Controlled Trial Candidate Guideline Item Development
Опубликована: Янв. 1, 2025
Язык: Английский
How to Write Effective Prompts for Screening Biomedical Literature Using Large Language Models
BioMedInformatics,
Год журнала:
2025,
Номер
5(1), С. 15 - 15
Опубликована: Март 11, 2025
Large
language
models
(LLMs)
have
emerged
as
powerful
tools
for
(semi-)automating
the
initial
screening
of
abstracts
in
systematic
reviews,
offering
potential
to
significantly
reduce
manual
burden
on
research
teams.
This
paper
provides
a
broad
overview
prompt
engineering
principles
and
highlights
how
traditional
PICO
(Population,
Intervention,
Comparison,
Outcome)
criteria
can
be
converted
into
actionable
instructions
LLMs.
We
analyze
trade-offs
between
“soft”
prompts,
which
maximize
recall
by
accepting
articles
unless
they
explicitly
fail
an
inclusion
requirement,
“strict”
demand
explicit
evidence
every
criterion.
Using
periodontics
case
study,
we
illustrate
design
affects
recall,
precision,
overall
efficiency
discuss
metrics
(accuracy,
F1
score)
evaluate
performance.
also
examine
common
pitfalls,
such
overly
lengthy
prompts
or
ambiguous
instructions,
underscore
continuing
need
expert
oversight
mitigate
hallucinations
biases
inherent
LLM
outputs.
Finally,
explore
emerging
trends,
including
multi-stage
pipelines
fine-tuning,
while
noting
ethical
considerations
related
data
privacy
transparency.
By
applying
rigorous
evaluation,
researchers
optimize
LLM-based
processes,
allowing
faster
more
comprehensive
synthesis
across
biomedical
disciplines.
Язык: Английский
GPT-3.5 Turbo and GPT-4 Turbo in Title and Abstract Screening for Systematic Reviews
JMIR Medical Informatics,
Год журнала:
2025,
Номер
13, С. e64682 - e64682
Опубликована: Март 12, 2025
Abstract
This
study
demonstrated
that
while
GPT-4
Turbo
had
superior
specificity
when
compared
to
GPT-3.5
(0.98
vs
0.51),
as
well
comparable
sensitivity
(0.85
0.83),
processed
100
studies
faster
(0.9
min
1.6
min)
in
citation
screening
for
systematic
reviews,
suggesting
may
be
more
suitable
due
its
higher
and
highlighting
the
potential
of
large
language
models
optimizing
literature
selection.
Язык: Английский
Validation of large language models (Llama 3 and ChatGPT-4o mini) for title and abstract screening in biomedical systematic reviews
Research Synthesis Methods,
Год журнала:
2025,
Номер
unknown, С. 1 - 11
Опубликована: Март 24, 2025
Abstract
With
the
increasing
volume
of
scientific
literature,
there
is
a
need
to
streamline
screening
process
for
titles
and
abstracts
in
systematic
reviews,
reduce
workload
reviewers,
minimize
errors.
This
study
validated
artificial
intelligence
(AI)
tools,
specifically
Llama
3
70B
via
Groq’s
application
programming
interface
(API)
ChatGPT-4o
mini
OpenAI’s
API,
automating
this
biomedical
research.
It
compared
these
AI
tools
with
human
reviewers
using
1,081
articles
after
duplicate
removal.
Each
model
was
tested
three
configurations
assess
sensitivity,
specificity,
predictive
values,
likelihood
ratios.
The
model’s
LLA_2
configuration
achieved
77.5%
sensitivity
91.4%
90.2%
accuracy,
positive
value
(PPV)
44.3%,
negative
(NPV)
97.9%.
CHAT_2
showed
56.2%
95.1%
92.0%
PPV
50.6%,
an
NPV
96.1%.
Both
models
demonstrated
strong
having
higher
overall
accuracy.
Despite
promising
results,
manual
validation
remains
necessary
address
false
positives
negatives,
ensuring
that
no
important
studies
are
overlooked.
suggests
can
significantly
enhance
efficiency
accuracy
potentially
revolutionizing
not
only
research
but
also
other
fields
requiring
extensive
literature
reviews.
Язык: Английский
Uncovering new psychoactive substances research trends using large language model-assisted text mining (LATeM)
Journal of Hazardous Materials Advances,
Год журнала:
2025,
Номер
unknown, С. 100700 - 100700
Опубликована: Март 1, 2025
Язык: Английский
Accelerating Disease Model Parameter Extraction: An LLM-Based Ranking Approach to Select Initial Studies for Literature Review Automation
Machine Learning and Knowledge Extraction,
Год журнала:
2025,
Номер
7(2), С. 28 - 28
Опубликована: Март 26, 2025
As
climate
change
transforms
our
environment
and
human
intrusion
into
natural
ecosystems
escalates,
there
is
a
growing
demand
for
disease
spread
models
to
forecast
plan
the
next
zoonotic
outbreak.
Accurate
parametrization
of
these
requires
data
from
diverse
sources,
including
scientific
literature.
Despite
abundance
publications,
manual
extraction
via
systematic
literature
reviews
remains
significant
bottleneck,
requiring
extensive
time
resources,
susceptible
error.
This
study
examines
application
large
language
model
(LLM)
as
an
assessor
screening
prioritisation
in
climate-sensitive
research.
By
framing
selection
criteria
articles
question–answer
task
utilising
zero-shot
chain-of-thought
prompting,
proposed
method
achieves
saving
at
least
70%
work
effort
compared
recall
level
95%
(NWSS@95%).
was
validated
across
four
datasets
containing
distinct
diseases
critical
variable
(rainfall).
The
approach
additionally
produces
explainable
AI
rationales
each
ranked
article.
effectiveness
multiple
demonstrates
potential
broad
reviews.
substantial
reduction
effort,
along
with
provision
rationales,
marks
important
step
toward
automated
parameter
Язык: Английский
Loon Lens 1.0 Validation: Agentic AI for Title and Abstract Screening in Systematic Literature Reviews
medRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Сен. 6, 2024
Abstract
Introduction
Systematic
literature
reviews
(SLRs)
are
critical
for
informing
clinical
research
and
practice,
but
they
time-consuming
resource-intensive,
particularly
during
Title
(TiAb)
screening.
Loon
Lens,
an
autonomous,
agentic
AI
platform,
streamlines
TiAb
screening
without
the
need
human
reviewers
to
conduct
any
Methods
This
study
validates
Lens
against
reviewer
decisions
across
eight
SLRs
conducted
by
Canada’s
Drug
Agency,
covering
a
range
of
drugs
eligibility
criteria.
A
total
3,796
citations
were
retrieved,
with
identifying
287
(7.6%)
inclusion.
autonomously
screened
same
based
on
provided
inclusion
exclusion
Metrics
such
as
accuracy,
recall,
precision,
F1
score,
specificity,
negative
predictive
value
(NPV)
calculated.
Bootstrapping
was
applied
compute
95%
confidence
intervals.
Results
achieved
accuracy
95.5%
(95%
CI:
94.8–96.1),
recall
at
98.95%
97.57–100%)
specificity
95.24%
94.54–95.89%).
Precision
lower
62.97%
58.39–67.27%),
suggesting
that
included
more
full-text
compared
reviewers.
The
score
0.770
0.734–0.802),
indicating
strong
balance
between
precision
recall.
Conclusion
demonstrates
ability
substantial
potential
reducing
time
cost
associated
manual
or
semi-autonomous
in
SLRs.
While
improvements
needed,
platform
offers
scalable,
autonomous
solution
systematic
reviews.
Access
is
available
upon
request
https://loonlens.com/
.
Язык: Английский