Molecules,
Journal Year:
2024,
Volume and Issue:
29(4), P. 903 - 903
Published: Feb. 18, 2024
Drug
discovery
plays
a
critical
role
in
advancing
human
health
by
developing
new
medications
and
treatments
to
combat
diseases.
How
accelerate
the
pace
reduce
costs
of
drug
has
long
been
key
concern
for
pharmaceutical
industry.
Fortunately,
leveraging
advanced
algorithms,
computational
power
biological
big
data,
artificial
intelligence
(AI)
technology,
especially
machine
learning
(ML),
holds
promise
making
hunt
drugs
more
efficient.
Recently,
Transformer-based
models
that
have
achieved
revolutionary
breakthroughs
natural
language
processing
sparked
era
their
applications
discovery.
Herein,
we
introduce
latest
ML
discovery,
highlight
potential
models,
discuss
future
prospects
challenges
field.
Nature Communications,
Journal Year:
2022,
Volume and Issue:
13(1)
Published: July 27, 2022
Protein
design
aims
to
build
novel
proteins
customized
for
specific
purposes,
thereby
holding
the
potential
tackle
many
environmental
and
biomedical
problems.
Recent
progress
in
Transformer-based
architectures
has
enabled
implementation
of
language
models
capable
generating
text
with
human-like
capabilities.
Here,
motivated
by
this
success,
we
describe
ProtGPT2,
a
model
trained
on
protein
space
that
generates
de
novo
sequences
following
principles
natural
ones.
The
generated
display
amino
acid
propensities,
while
disorder
predictions
indicate
88%
ProtGPT2-generated
are
globular,
line
sequences.
Sensitive
sequence
searches
databases
show
ProtGPT2
distantly
related
ones,
similarity
networks
further
demonstrate
is
sampling
unexplored
regions
space.
AlphaFold
prediction
ProtGPT2-sequences
yields
well-folded
non-idealized
structures
embodiments
large
loops
reveals
topologies
not
captured
current
structure
databases.
matter
seconds
freely
available.
Nucleic Acids Research,
Journal Year:
2022,
Volume and Issue:
50(W1), P. W228 - W234
Published: April 19, 2022
The
prediction
of
protein
subcellular
localization
is
great
relevance
for
proteomics
research.
Here,
we
propose
an
update
to
the
popular
tool
DeepLoc
with
multi-localization
and
improvements
in
both
performance
interpretability.
For
training
validation,
curate
eukaryotic
human
multi-location
datasets
stringent
homology
partitioning
enriched
sorting
signal
information
compiled
from
literature.
We
achieve
state-of-the-art
2.0
by
using
a
pre-trained
language
model.
It
has
further
advantage
that
it
uses
sequence
input
rather
than
relying
on
slower
profiles.
provide
two
means
better
interpretability:
attention
output
along
highly
accurate
nine
different
types
signals.
find
correlates
well
position
webserver
available
at
services.healthtech.dtu.dk/service.php?DeepLoc-2.0.
Nature Genetics,
Journal Year:
2023,
Volume and Issue:
55(9), P. 1512 - 1522
Published: Aug. 10, 2023
Predicting
the
effects
of
coding
variants
is
a
major
challenge.
While
recent
deep-learning
models
have
improved
variant
effect
prediction
accuracy,
they
cannot
analyze
all
due
to
dependency
on
close
homologs
or
software
limitations.
Here
we
developed
workflow
using
ESM1b,
650-million-parameter
protein
language
model,
predict
~450
million
possible
missense
in
human
genome,
and
made
predictions
available
web
portal.
ESM1b
outperformed
existing
methods
classifying
~150,000
ClinVar/HGMD
as
pathogenic
benign
predicting
measurements
across
28
deep
mutational
scan
datasets.
We
further
annotated
~2
damaging
only
specific
isoforms,
demonstrating
importance
considering
isoforms
when
effects.
Our
approach
also
generalizes
more
complex
such
in-frame
indels
stop-gains.
Together,
these
results
establish
an
effective,
accurate
general
Protein Science,
Journal Year:
2022,
Volume and Issue:
31(12)
Published: Nov. 11, 2022
B-cell
epitope
prediction
tools
are
of
great
medical
and
commercial
interest
due
to
their
practical
applications
in
vaccine
development
disease
diagnostics.
The
introduction
protein
language
models
(LMs),
trained
on
unprecedented
large
datasets
sequences
structures,
tap
into
a
powerful
numeric
representation
that
can
be
exploited
accurately
predict
local
global
structural
features
from
amino
acid
only.
In
this
paper,
we
present
BepiPred-3.0,
sequence-based
tool
that,
by
exploiting
LM
embeddings,
greatly
improves
the
accuracy
for
both
linear
conformational
several
independent
test
sets.
Furthermore,
carefully
selecting
additional
input
variables
residue
annotation
strategy,
performance
was
further
improved,
thus
achieving
predictive
power.
Our
epitopes
across
hundreds
minutes.
It
is
freely
available
as
web
server
standalone
package
at
https://services.healthtech.dtu.dk/service.php?BepiPred-3.0
with
user-friendly
interface
navigate
results.
Predicting
the
function
of
a
protein
from
its
amino
acid
sequence
is
long-standing
challenge
in
bioinformatics.
Traditional
approaches
use
alignment
to
compare
query
either
thousands
models
families
or
large
databases
individual
sequences.
Here
we
introduce
ProteInfer,
which
instead
employs
deep
convolutional
neural
networks
directly
predict
variety
functions
–
Enzyme
Commission
(EC)
numbers
and
Gene
Ontology
(GO)
terms
an
unaligned
sequence.
This
approach
provides
precise
predictions
complement
alignment-based
methods,
computational
efficiency
single
network
permits
novel
lightweight
software
interfaces,
demonstrate
with
in-browser
graphical
interface
for
prediction
all
computation
performed
on
user’s
personal
computer
no
data
uploaded
remote
servers.
Moreover,
these
place
full-length
sequences
into
generalised
functional
space,
facilitating
downstream
analysis
interpretation.
To
read
interactive
version
this
paper,
please
visit
https://google-research.github.io/proteinfer/
.
Recent
developments
in
deep
learning,
coupled
with
an
increasing
number
of
sequenced
proteins,
have
led
to
a
breakthrough
life
science
applications,
particular
protein
property
prediction.
There
is
hope
that
learning
can
close
the
gap
between
proteins
and
known
properties
based
on
lab
experiments.
Language
models
from
field
natural
language
processing
gained
popularity
for
predictions
new
computational
revolution
biology,
where
old
prediction
results
are
being
improved
regularly.
Such
learn
useful
multipurpose
representations
large
open
repositories
sequences
be
used,
instance,
predict
properties.
The
growing
quickly
because
class
model-the
Transformer
model.
We
review
recent
use
large-scale
applications
predicting
characteristics
how
such
used
predict,
example,
post-translational
modifications.
shortcomings
other
explain
proven
very
promising
way
unravel
information
hidden
amino
acids.
Bioinformatics Advances,
Journal Year:
2023,
Volume and Issue:
3(1)
Published: Jan. 1, 2023
Abstract
Summary
The
transformer-based
language
models,
including
vanilla
transformer,
BERT
and
GPT-3,
have
achieved
revolutionary
breakthroughs
in
the
field
of
natural
processing
(NLP).
Since
there
are
inherent
similarities
between
various
biological
sequences
languages,
remarkable
interpretability
adaptability
these
models
prompted
a
new
wave
their
application
bioinformatics
research.
To
provide
timely
comprehensive
review,
we
introduce
key
developments
by
describing
detailed
structure
transformers
summarize
contribution
to
wide
range
research
from
basic
sequence
analysis
drug
discovery.
While
applications
diverse
multifaceted,
identify
discuss
common
challenges,
heterogeneity
training
data,
computational
expense
model
interpretability,
opportunities
context
We
hope
that
broader
community
NLP
researchers,
bioinformaticians
biologists
will
be
brought
together
foster
future
development
inspire
novel
unattainable
traditional
methods.
Supplementary
information
data
available
at
Bioinformatics
Advances
online.
ACM Computing Surveys,
Journal Year:
2023,
Volume and Issue:
56(3), P. 1 - 52
Published: Aug. 1, 2023
Pre-trained
language
models
(PLMs)
have
been
the
de
facto
paradigm
for
most
natural
processing
tasks.
This
also
benefits
biomedical
domain:
researchers
from
informatics,
medicine,
and
computer
science
communities
propose
various
PLMs
trained
on
datasets,
e.g.,
text,
electronic
health
records,
protein,
DNA
sequences
However,
cross-discipline
characteristics
of
hinder
their
spreading
among
communities;
some
existing
works
are
isolated
each
other
without
comprehensive
comparison
discussions.
It
is
nontrivial
to
make
a
survey
that
not
only
systematically
reviews
recent
advances
in
applications
but
standardizes
terminology
benchmarks.
article
summarizes
progress
pre-trained
domain
downstream
Particularly,
we
discuss
motivations
introduce
key
concepts
models.
We
then
taxonomy
categorizes
them
perspectives
systematically.
Plus,
tasks
exhaustively
discussed,
respectively.
Last,
illustrate
limitations
future
trends,
which
aims
provide
inspiration
research.
Abstract
Large‐scale
artificial
intelligence
(AI)
models
such
as
ChatGPT
have
the
potential
to
improve
performance
on
many
benchmarks
and
real‐world
tasks.
However,
it
is
difficult
develop
maintain
these
because
of
their
complexity
resource
requirements.
As
a
result,
they
are
still
inaccessible
healthcare
industries
clinicians.
This
situation
might
soon
be
changed
advancements
in
graphics
processing
unit
(GPU)
programming
parallel
computing.
More
importantly,
leveraging
existing
large‐scale
AIs
GPT‐4
Med‐PaLM
integrating
them
into
multiagent
(e.g.,
Visual‐ChatGPT)
will
facilitate
implementations.
review
aims
raise
awareness
applications
healthcare.
We
provide
general
overview
several
advanced
AI
models,
including
language
vision‐language
graph
learning
language‐conditioned
multimodal
embodied
models.
discuss
medical
addition
challenges
future
directions.
Importantly,
we
stress
need
align
with
human
values
goals,
using
reinforcement
from
feedback,
ensure
that
accurate
personalized
insights
support
decision‐making
outcomes.