ACM Computing Surveys,
Journal Year:
2023,
Volume and Issue:
56(3), P. 1 - 52
Published: Aug. 1, 2023
Pre-trained
language
models
(PLMs)
have
been
the
de
facto
paradigm
for
most
natural
processing
tasks.
This
also
benefits
biomedical
domain:
researchers
from
informatics,
medicine,
and
computer
science
communities
propose
various
PLMs
trained
on
datasets,
e.g.,
text,
electronic
health
records,
protein,
DNA
sequences
However,
cross-discipline
characteristics
of
hinder
their
spreading
among
communities;
some
existing
works
are
isolated
each
other
without
comprehensive
comparison
discussions.
It
is
nontrivial
to
make
a
survey
that
not
only
systematically
reviews
recent
advances
in
applications
but
standardizes
terminology
benchmarks.
article
summarizes
progress
pre-trained
domain
downstream
Particularly,
we
discuss
motivations
introduce
key
concepts
models.
We
then
taxonomy
categorizes
them
perspectives
systematically.
Plus,
tasks
exhaustively
discussed,
respectively.
Last,
illustrate
limitations
future
trends,
which
aims
provide
inspiration
research.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2023,
Volume and Issue:
unknown
Published: Dec. 8, 2023
Predicting
the
effects
of
mutations
in
proteins
is
critical
to
many
applications,
from
understanding
genetic
disease
designing
novel
that
can
address
our
most
pressing
challenges
climate,
agriculture
and
healthcare.
Despite
a
surge
machine
learning-based
protein
models
tackle
these
questions,
an
assessment
their
respective
benefits
challenging
due
use
distinct,
often
contrived,
experimental
datasets,
variable
performance
across
different
families.
Addressing
requires
scale.
To
end
we
introduce
ProteinGym,
large-scale
holistic
set
benchmarks
specifically
designed
for
fitness
prediction
design.
It
encompasses
both
broad
collection
over
250
standardized
deep
mutational
scanning
assays,
spanning
millions
mutated
sequences,
as
well
curated
clinical
datasets
providing
high-quality
expert
annotations
about
mutation
effects.
We
devise
robust
evaluation
framework
combines
metrics
design,
factors
known
limitations
underlying
methods,
covers
zero-shot
supervised
settings.
report
diverse
70
high-performing
various
subfields
(eg.,
alignment-based,
inverse
folding)
into
unified
benchmark
suite.
open
source
corresponding
codebase,
MSAs,
structures,
model
predictions
develop
user-friendly
website
facilitates
data
access
analysis.
Bioinformatics Advances,
Journal Year:
2023,
Volume and Issue:
3(1)
Published: Jan. 1, 2023
Abstract
Summary
The
transformer-based
language
models,
including
vanilla
transformer,
BERT
and
GPT-3,
have
achieved
revolutionary
breakthroughs
in
the
field
of
natural
processing
(NLP).
Since
there
are
inherent
similarities
between
various
biological
sequences
languages,
remarkable
interpretability
adaptability
these
models
prompted
a
new
wave
their
application
bioinformatics
research.
To
provide
timely
comprehensive
review,
we
introduce
key
developments
by
describing
detailed
structure
transformers
summarize
contribution
to
wide
range
research
from
basic
sequence
analysis
drug
discovery.
While
applications
diverse
multifaceted,
identify
discuss
common
challenges,
heterogeneity
training
data,
computational
expense
model
interpretability,
opportunities
context
We
hope
that
broader
community
NLP
researchers,
bioinformaticians
biologists
will
be
brought
together
foster
future
development
inspire
novel
unattainable
traditional
methods.
Supplementary
information
data
available
at
Bioinformatics
Advances
online.
Proceedings of the National Academy of Sciences,
Journal Year:
2023,
Volume and Issue:
120(24)
Published: June 8, 2023
Sequence-based
prediction
of
drug-target
interactions
has
the
potential
to
accelerate
drug
discovery
by
complementing
experimental
screens.
Such
computational
needs
be
generalizable
and
scalable
while
remaining
sensitive
subtle
variations
in
inputs.
However,
current
techniques
fail
simultaneously
meet
these
goals,
often
sacrificing
performance
one
achieve
others.
We
develop
a
deep
learning
model,
ConPLex,
successfully
leveraging
advances
pretrained
protein
language
models
("PLex")
employing
protein-anchored
contrastive
coembedding
("Con")
outperform
state-of-the-art
approaches.
ConPLex
achieves
high
accuracy,
broad
adaptivity
unseen
data,
specificity
against
decoy
compounds.
It
makes
predictions
binding
based
on
distance
between
learned
representations,
enabling
at
scale
massive
compound
libraries
human
proteome.
Experimental
testing
19
kinase-drug
interaction
validated
12
interactions,
including
four
with
subnanomolar
affinity,
plus
strongly
EPHB1
inhibitor
(KD
=
1.3
nM).
Furthermore,
embeddings
are
interpretable,
which
enables
us
visualize
embedding
space
use
characterize
function
cell-surface
proteins.
anticipate
that
will
facilitate
efficient
making
highly
silico
screening
feasible
genome
scale.
is
available
open
source
https://ConPLex.csail.mit.edu.
ACM Computing Surveys,
Journal Year:
2023,
Volume and Issue:
56(3), P. 1 - 52
Published: Aug. 1, 2023
Pre-trained
language
models
(PLMs)
have
been
the
de
facto
paradigm
for
most
natural
processing
tasks.
This
also
benefits
biomedical
domain:
researchers
from
informatics,
medicine,
and
computer
science
communities
propose
various
PLMs
trained
on
datasets,
e.g.,
text,
electronic
health
records,
protein,
DNA
sequences
However,
cross-discipline
characteristics
of
hinder
their
spreading
among
communities;
some
existing
works
are
isolated
each
other
without
comprehensive
comparison
discussions.
It
is
nontrivial
to
make
a
survey
that
not
only
systematically
reviews
recent
advances
in
applications
but
standardizes
terminology
benchmarks.
article
summarizes
progress
pre-trained
domain
downstream
Particularly,
we
discuss
motivations
introduce
key
concepts
models.
We
then
taxonomy
categorizes
them
perspectives
systematically.
Plus,
tasks
exhaustively
discussed,
respectively.
Last,
illustrate
limitations
future
trends,
which
aims
provide
inspiration
research.