bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Июль 4, 2024
Abstract
Recent
advancements
in
Transformer-based
models
have
spurred
interest
their
use
for
biological
sequence
analysis.
However,
adapting
like
BERT
is
challenging
due
to
length,
often
requiring
truncation
proteomics
and
genomics
tasks.
Additionally,
advanced
tokenization
relative
positional
encoding
techniques
long
contexts
NLP
are
not
directly
transferable
DNA/RNA
sequences,
which
require
nucleotide
or
character-level
encodings
tasks
such
as
3D
torsion
angle
prediction.
To
tackle
these
challenges,
we
propose
an
adaptive
dual
scheme
bioinformatics
that
utilizes
both
nucleotide-level
(NUC)
efficient
BPE
tokenizations.
Building
on
the
tokenization,
introduce
BiRNA-BERT,
a
117M
parameter
Transformer
encoder
pretrained
with
our
proposed
28
billion
nucleotides
across
36
million
coding
non-coding
RNA
sequences.
The
learned
representation
by
BiRNA-BERT
generalizes
range
of
applications
achieves
state-of-the-art
results
long-sequence
downstream
performance
comparable
6×
larger
short-sequence
27×less
pre-training
compute.
can
dynamically
adjust
its
strategy
based
lengths,
utilizing
NUC
shorter
sequences
switching
longer
ones,
thereby
offering,
first
time,
capability
efficiently
handle
arbitrarily
1
ACM Computing Surveys,
Год журнала:
2023,
Номер
56(3), С. 1 - 52
Опубликована: Авг. 1, 2023
Pre-trained
language
models
(PLMs)
have
been
the
de
facto
paradigm
for
most
natural
processing
tasks.
This
also
benefits
biomedical
domain:
researchers
from
informatics,
medicine,
and
computer
science
communities
propose
various
PLMs
trained
on
datasets,
e.g.,
text,
electronic
health
records,
protein,
DNA
sequences
However,
cross-discipline
characteristics
of
hinder
their
spreading
among
communities;
some
existing
works
are
isolated
each
other
without
comprehensive
comparison
discussions.
It
is
nontrivial
to
make
a
survey
that
not
only
systematically
reviews
recent
advances
in
applications
but
standardizes
terminology
benchmarks.
article
summarizes
progress
pre-trained
domain
downstream
Particularly,
we
discuss
motivations
introduce
key
concepts
models.
We
then
taxonomy
categorizes
them
perspectives
systematically.
Plus,
tasks
exhaustively
discussed,
respectively.
Last,
illustrate
limitations
future
trends,
which
aims
provide
inspiration
research.
MedComm – Future Medicine,
Год журнала:
2023,
Номер
2(2)
Опубликована: Май 17, 2023
Abstract
Large‐scale
artificial
intelligence
(AI)
models
such
as
ChatGPT
have
the
potential
to
improve
performance
on
many
benchmarks
and
real‐world
tasks.
However,
it
is
difficult
develop
maintain
these
because
of
their
complexity
resource
requirements.
As
a
result,
they
are
still
inaccessible
healthcare
industries
clinicians.
This
situation
might
soon
be
changed
advancements
in
graphics
processing
unit
(GPU)
programming
parallel
computing.
More
importantly,
leveraging
existing
large‐scale
AIs
GPT‐4
Med‐PaLM
integrating
them
into
multiagent
(e.g.,
Visual‐ChatGPT)
will
facilitate
implementations.
review
aims
raise
awareness
applications
healthcare.
We
provide
general
overview
several
advanced
AI
models,
including
language
vision‐language
graph
learning
language‐conditioned
multimodal
embodied
models.
discuss
medical
addition
challenges
future
directions.
Importantly,
we
stress
need
align
with
human
values
goals,
using
reinforcement
from
feedback,
ensure
that
accurate
personalized
insights
support
decision‐making
outcomes.
Biomolecules,
Год журнала:
2024,
Номер
14(3), С. 339 - 339
Опубликована: Март 12, 2024
Recent
advancements
in
AI-driven
technologies,
particularly
protein
structure
prediction,
are
significantly
reshaping
the
landscape
of
drug
discovery
and
development.
This
review
focuses
on
question
how
these
technological
breakthroughs,
exemplified
by
AlphaFold2,
revolutionizing
our
understanding
function
changes
underlying
cancer
improve
approaches
to
counter
them.
By
enhancing
precision
speed
at
which
targets
identified
candidates
can
be
designed
optimized,
technologies
streamlining
entire
development
process.
We
explore
use
AlphaFold2
development,
scrutinizing
its
efficacy,
limitations,
potential
challenges.
also
compare
with
other
algorithms
like
ESMFold,
explaining
diverse
methodologies
employed
this
field
practical
effects
differences
for
application
specific
algorithms.
Additionally,
we
discuss
broader
applications
including
prediction
complex
structures
generative
design
novel
proteins.
Journal of Medical Internet Research,
Год журнала:
2024,
Номер
26, С. e59505 - e59505
Опубликована: Авг. 20, 2024
In
the
complex
and
multidimensional
field
of
medicine,
multimodal
data
are
prevalent
crucial
for
informed
clinical
decisions.
Multimodal
span
a
broad
spectrum
types,
including
medical
images
(eg,
MRI
CT
scans),
time-series
sensor
from
wearable
devices
electronic
health
records),
audio
recordings
heart
respiratory
sounds
patient
interviews),
text
notes
research
articles),
videos
surgical
procedures),
omics
genomics
proteomics).
While
advancements
in
large
language
models
(LLMs)
have
enabled
new
applications
knowledge
retrieval
processing
field,
most
LLMs
remain
limited
to
unimodal
data,
typically
text-based
content,
often
overlook
importance
integrating
diverse
modalities
encountered
practice.
This
paper
aims
present
detailed,
practical,
solution-oriented
perspective
on
use
(M-LLMs)
field.
Our
investigation
spanned
M-LLM
foundational
principles,
current
potential
applications,
technical
ethical
challenges,
future
directions.
By
connecting
these
elements,
we
aimed
provide
comprehensive
framework
that
links
aspects
M-LLMs,
offering
unified
vision
their
care.
approach
guide
both
practical
implementations
M-LLMs
care,
positioning
them
as
paradigm
shift
toward
integrated,
data–driven
We
anticipate
this
work
will
spark
further
discussion
inspire
development
innovative
approaches
next
generation
systems.
Computational and Structural Biotechnology Journal,
Год журнала:
2021,
Номер
19, С. 3198 - 3208
Опубликована: Янв. 1, 2021
Although
remarkable
advances
have
been
reported
in
high-throughput
sequencing,
the
ability
to
aptly
analyze
a
substantial
amount
of
rapidly
generated
biological
(DNA/RNA/protein)
sequencing
data
remains
critical
hurdle.
To
tackle
this
issue,
application
natural
language
processing
(NLP)
sequence
analysis
has
received
increased
attention.
In
method,
sequences
are
regarded
as
sentences
while
single
nucleic
acids/amino
acids
or
k-mers
these
represent
words.
Embedding
is
an
essential
step
NLP,
which
performs
conversion
words
into
vectors.
Specifically,
representation
learning
approach
used
for
transformation
process,
can
be
applied
sequences.
Vectorized
then
function
and
structure
estimation,
input
other
probabilistic
models.
Considering
importance
growing
trend
research,
present
study,
we
reviewed
existing
knowledge
analysis.
Briefings in Bioinformatics,
Год журнала:
2023,
Номер
24(4)
Опубликована: Май 25, 2023
Computational
analysis
of
RNA
sequences
constitutes
a
crucial
step
in
the
field
biology.
As
other
domains
life
sciences,
incorporation
artificial
intelligence
and
machine
learning
techniques
into
sequence
has
gained
significant
traction
recent
years.
Historically,
thermodynamics-based
methods
were
widely
employed
for
prediction
secondary
structures;
however,
learning-based
approaches
have
demonstrated
remarkable
advancements
years,
enabling
more
accurate
predictions.
Consequently,
precision
pertaining
to
structures,
such
as
RNA-protein
interactions,
also
been
enhanced,
making
substantial
contribution
Additionally,
are
introducing
technical
innovations
RNA-small
molecule
interactions
RNA-targeted
drug
discovery
design
aptamers,
where
serves
its
own
ligand.
This
review
will
highlight
trends
structure,
aptamers
using
learning,
deep
related
technologies,
discuss
potential
future
avenues
informatics.
arXiv (Cornell University),
Год журнала:
2024,
Номер
unknown
Опубликована: Янв. 1, 2024
Large
language
models
(LLMs)
are
a
class
of
artificial
intelligence
based
on
deep
learning,
which
have
great
performance
in
various
tasks,
especially
natural
processing
(NLP).
typically
consist
neural
networks
with
numerous
parameters,
trained
large
amounts
unlabeled
input
using
self-supervised
or
semi-supervised
learning.
However,
their
potential
for
solving
bioinformatics
problems
may
even
exceed
proficiency
modeling
human
language.
In
this
review,
we
will
present
summary
the
prominent
used
processing,
such
as
BERT
and
GPT,
focus
exploring
applications
at
different
omics
levels
bioinformatics,
mainly
including
genomics,
transcriptomics,
proteomics,
drug
discovery
single
cell
analysis.
Finally,
review
summarizes
prospects
bioinformatic
problems.
Briefings in Bioinformatics,
Год журнала:
2023,
Номер
25(1)
Опубликована: Ноя. 22, 2023
Abstract
Researchers
increasingly
turn
to
explainable
artificial
intelligence
(XAI)
analyze
omics
data
and
gain
insights
into
the
underlying
biological
processes.
Yet,
given
interdisciplinary
nature
of
field,
many
findings
have
only
been
shared
in
their
respective
research
community.
An
overview
XAI
for
is
needed
highlight
promising
approaches
help
detect
common
issues.
Toward
this
end,
we
conducted
a
systematic
mapping
study.
To
identify
relevant
literature,
queried
Scopus,
PubMed,
Web
Science,
BioRxiv,
MedRxiv
arXiv.
Based
on
keywording,
developed
coding
scheme
with
10
facets
regarding
studies’
AI
methods,
explainability
methods
data.
Our
study
resulted
405
included
papers
published
between
2010
2023.
The
inspected
DNA-based
(mostly
genomic),
transcriptomic,
proteomic
or
metabolomic
by
means
neural
networks,
tree-based
statistical
further
methods.
preferred
post-hoc
are
feature
relevance
(n
=
166)
visual
explanation
52),
while
using
interpretable
often
resort
use
transparent
models
83)
architecture
modifications
72).
With
gaps
still
apparent
data,
deduced
eight
directions
discuss
potential
field.
We
also
provide
exemplary
questions
each
direction.
Many
problems
adoption
clinical
practice
yet
be
resolved.
This
outlines
extant
topic
provides
researchers
practitioners.