Genomic language models: opportunities and challenges
Gonzalo Benegas,
No information about this author
Chengzhong Ye,
No information about this author
Carlos Albors
No information about this author
et al.
Trends in Genetics,
Journal Year:
2025,
Volume and Issue:
unknown
Published: Jan. 1, 2025
Language: Английский
Scientific Large Language Models: A Survey on Biological & Chemical Domains
ACM Computing Surveys,
Journal Year:
2025,
Volume and Issue:
unknown
Published: Jan. 26, 2025
Large
Language
Models
(LLMs)
have
emerged
as
a
transformative
power
in
enhancing
natural
language
comprehension,
representing
significant
stride
toward
artificial
general
intelligence.
The
application
of
LLMs
extends
beyond
conventional
linguistic
boundaries,
encompassing
specialized
systems
developed
within
various
scientific
disciplines.
This
growing
interest
has
led
to
the
advent
LLMs,
novel
subclass
specifically
engineered
for
facilitating
discovery.
As
burgeoning
area
community
AI
Science,
warrant
comprehensive
exploration.
However,
systematic
and
up-to-date
survey
introducing
them
is
currently
lacking.
In
this
paper,
we
endeavor
methodically
delineate
concept
“scientific
language”,
whilst
providing
thorough
review
latest
advancements
LLMs.
Given
expansive
realm
disciplines,
our
analysis
adopts
focused
lens,
concentrating
on
biological
chemical
domains.
includes
an
in-depth
examination
textual
knowledge,
small
molecules,
macromolecular
proteins,
genomic
sequences,
their
combinations,
analyzing
terms
model
architectures,
capabilities,
datasets,
evaluation.
Finally,
critically
examine
prevailing
challenges
point
out
promising
research
directions
along
with
advances
By
offering
overview
technical
developments
field,
aspires
be
invaluable
resource
researchers
navigating
intricate
landscape
Language: Английский
Evaluating the representational power of pre-trained DNA language models for regulatory genomics
Ziqi Tang,
No information about this author
Nikunj V. Somia,
No information about this author
Yiyang Yu
No information about this author
et al.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: March 4, 2024
ABSTRACT
The
emergence
of
genomic
language
models
(gLMs)
offers
an
unsupervised
approach
to
learning
a
wide
diversity
cis
-regulatory
patterns
in
the
non-coding
genome
without
requiring
labels
functional
activity
generated
by
wet-lab
experiments.
Previous
evaluations
have
shown
that
pre-trained
gLMs
can
be
leveraged
improve
predictive
performance
across
broad
range
regulatory
genomics
tasks,
albeit
using
relatively
simple
benchmark
datasets
and
baseline
models.
Since
these
studies
were
tested
upon
fine-tuning
their
weights
for
each
downstream
task,
determining
whether
gLM
representations
embody
foundational
understanding
biology
remains
open
question.
Here
we
evaluate
representational
power
predict
interpret
cell-type-specific
data
span
DNA
RNA
regulation.
Our
findings
suggest
probing
do
not
offer
substantial
advantages
over
conventional
machine
approaches
use
one-hot
encoded
sequences.
This
work
highlights
major
gap
with
current
gLMs,
raising
potential
issues
pre-training
strategies
genome.
Language: Английский
Synthetic genomes unveil the effects of synonymous recoding
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: June 16, 2024
Abstract
Engineering
the
genetic
code
of
an
organism
provides
basis
for
(i)
making
any
safely
resistant
to
natural
viruses
and
(ii)
preventing
information
flow
into
out
genetically
modified
organisms
while
(iii)
allowing
biosynthesis
encoded
unnatural
polymers
1–4
.
Achieving
these
three
goals
requires
reassignment
multiple
64
codons
nature
uses
encode
proteins.
However,
synonymous
codon
replacement—recoding—is
frequently
lethal,
how
recoding
impacts
fitness
remains
poorly
explored.
Here,
we
explore
effects
using
whole-genome
synthesis,
multiplexed
directed
evolution,
genome-transcriptome-translatome-proteome
co-profiling
on
recoded
genomes.
Using
this
information,
assemble
a
synthetic
Escherichia
coli
genome
in
seven
sections
only
57
By
discovering
rules
responsible
lethality
developing
data-driven
multi-omics-based
construction
workflow
that
troubleshoots
genomes,
overcome
lethal
62,007
swaps
11,108
additional
genomic
edits.
We
show
induces
transcriptional
noise
including
new
antisense
RNAs,
leading
drastic
transcriptome
proteome
perturbation.
As
elimination
select
from
organism’s
results
widespread
appearance
cryptic
promoters,
choice
may
naturally
evolve
minimize
noise.
Our
work
first
genome-scale
description
changes
influence
organismal
paves
way
functional
genomes
provide
firewalls
ecosystems
produce
biopolymers,
drugs,
enzymes
with
expanded
chemistry.
Language: Английский
DeepInterAware: Deep Interaction Interface‐Aware Network for Improving Antigen‐Antibody Interaction Prediction from Sequence Data
Yiben Xia,
No information about this author
Zhiwei Wang,
No information about this author
Feng Huang
No information about this author
et al.
Advanced Science,
Journal Year:
2025,
Volume and Issue:
unknown
Published: Feb. 11, 2025
Abstract
Identifying
interactions
between
candidate
antibodies
and
target
antigens
is
a
key
step
in
developing
effective
human
therapeutics.
The
antigen–antibody
interaction
(AAI)
occurs
at
the
structural
level,
but
limited
structure
data
poses
significant
challenge.
However,
recent
studies
revealed
that
information
can
be
learned
from
vast
amount
of
sequence
data,
indicating
prediction
benefit
abundance
antigen
antibody
sequences.
In
this
study,
DeepInterAware
(deep
interface‐aware
network)
proposed,
framework
dynamically
incorporating
interface
directly
along
with
inherent
specificity
Experimental
results
demonstrate
outperforms
existing
methods
exhibits
promising
inductive
capabilities
for
predicting
involving
unseen
or
antibodies,
transfer
similar
tasks.
More
notably,
has
unique
advantages
lack.
First,
dive
into
underlying
mechanisms
AAIs,
offering
ability
to
identify
potential
binding
sites.
Second,
it
proficient
detecting
mutations
within
extended
precise
predictions
free
energy
changes
upon
mutations.
HER2‐targeting
screening
experiment
further
underscores
DeepInterAware's
exceptional
capability
identifying
antigens,
establishing
as
an
important
tool
screening.
Language: Английский
Large language model applications in nucleic acid research
Published: Jan. 1, 2025
Language: Английский
ABI and generative biology: A new paradigm for gene therapy, genome engineering, and engineered cell therapy
Adrian Woolfson
No information about this author
Molecular Therapy,
Journal Year:
2025,
Volume and Issue:
unknown
Published: March 1, 2025
Language: Английский
The design and engineering of synthetic genomes
Nature Reviews Genetics,
Journal Year:
2024,
Volume and Issue:
unknown
Published: Nov. 6, 2024
Language: Английский
Protein Set Transformer: A protein-based genome language model to power high diversity viromics
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: July 29, 2024
Abstract
Exponential
increases
in
microbial
and
viral
genomic
data
demand
transformational
advances
scalable,
generalizable
frameworks
for
their
interpretation.
Standard
homology-based
functional
analyses
are
hindered
by
the
rapid
divergence
of
especially
genomes
proteins
that
significantly
decreases
volume
usable
data.
Here,
we
present
Protein
Set
Transformer
(PST),
a
protein-based
genome
language
model
models
as
sets
without
considering
sparsely
available
labels.
Trained
on
>100k
viruses,
PST
outperformed
other
homology-
model-based
approaches
relating
based
shared
protein
content.
Further,
demonstrated
structural
awareness
clustering
capsid-fold-containing
with
known
capsid
uniquely
late
gene
within
related
viruses.
Our
establish
valuable
method
diverse
genomics,
ecology,
evolutionary
applications.
We
posit
framework
can
be
foundation
genomics
when
trained
suitable
Language: Английский
Protein Set Transformer: A protein-based genome language model to power high diversity viromics
Research Square (Research Square),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Sept. 23, 2024
Exponential
increases
in
microbial
and
viral
genomic
data
demand
transformational
advances
scalable,
generalizable
frameworks
for
their
interpretation.
Standard
homology-based
functional
analyses
are
hindered
by
the
rapid
divergence
of
especially
genomes
proteins
that
significantly
decreases
volume
usable
data.
Here,
we
present
Protein
Set
Transformer
(PST),
a
protein-based
genome
language
model
models
as
sets
without
considering
sparsely
available
labels.
Trained
on
>100k
viruses,
PST
outperformed
other
homology-
model-based
approaches
relating
based
shared
protein
content.
Further,
demonstrated
structural
awareness
clustering
capsid-fold-containing
with
known
capsid
uniquely
late
gene
within
related
viruses.
Our
establish
valuable
method
diverse
genomics,
ecology,
evolutionary
applications.
We
posit
framework
can
be
foundation
genomics
when
trained
suitable
Language: Английский