bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Nov. 3, 2024
Abstract
DNA
methylation
serves
as
a
powerful
biomarker
for
disease
diagnosis
and
biological
age
assessment.
However,
current
analytical
approaches
often
rely
on
linear
models
that
cannot
capture
the
complex,
context-dependent
nature
of
regulation.
Here
we
present
MethylGPT,
transformer-based
foundation
model
trained
226,555
(154,063
after
QC
deduplication)
human
profiles
spanning
diverse
tissue
types
from
5,281
datasets,
curated
49,156
CpG
sites,
7.6
billion
training
tokens.
MethylGPT
learns
biologically
meaningful
representations
capturing
both
local
genomic
context
higher-order
chromosomal
features
without
external
supervision.
The
demonstrates
robust
value
prediction
(Pearson
R=0.929)
maintains
stable
performance
in
downstream
tasks
with
up
to
70%
missing
data.
Applied
across
multiple
types,
achieves
superior
accuracy
compared
existing
methods.
Analysis
model’s
attention
patterns
reveals
distinct
signatures
between
young
old
samples,
differential
enrichment
developmental
aging-associated
pathways.
When
finetuned
mortality
60
major
conditions
using
18,859
samples
Generation
Scotland,
predictive
enables
systematic
evaluation
intervention
effects
risks,
demonstrating
potential
clinical
applications.
Our
results
demonstrate
transformer
architectures
can
effectively
while
preserving
interpretability,
suggesting
broad
utility
epigenetic
analysis
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: April 22, 2024
Gene
editing
has
the
potential
to
solve
fundamental
challenges
in
agriculture,
biotechnology,
and
human
health.
CRISPR-based
gene
editors
derived
from
microbes,
while
powerful,
often
show
significant
functional
tradeoffs
when
ported
into
non-native
environments,
such
as
cells.
Artificial
intelligence
(AI)
enabled
design
provides
a
powerful
alternative
with
bypass
evolutionary
constraints
generate
optimal
properties.
Here,
using
large
language
models
(LLMs)
trained
on
biological
diversity
at
scale,
we
demonstrate
first
successful
precision
of
genome
programmable
editor
designed
AI.
To
achieve
this
goal,
curated
dataset
over
one
million
CRISPR
operons
through
systematic
mining
26
terabases
assembled
genomes
meta-genomes.
We
capacity
our
by
generating
4.8x
number
protein
clusters
across
CRISPR-Cas
families
found
nature
tailoring
single-guide
RNA
sequences
for
Cas9-like
effector
proteins.
Several
generated
comparable
or
improved
activity
specificity
relative
SpCas9,
prototypical
effector,
being
400
mutations
away
sequence.
Finally,
an
AI-generated
editor,
denoted
OpenCRISPR-1,
exhibits
compatibility
base
editing.
release
OpenCRISPR-1
publicly
facilitate
broad,
ethical
usage
research
commercial
applications.
Proceedings of the National Academy of Sciences,
Journal Year:
2024,
Volume and Issue:
121(26)
Published: June 20, 2024
Proteomics
has
been
revolutionized
by
large
protein
language
models
(PLMs),
which
learn
unsupervised
representations
from
corpora
of
sequences.
These
are
typically
fine-tuned
in
a
supervised
setting
to
adapt
the
model
specific
downstream
tasks.
However,
computational
and
memory
footprint
fine-tuning
(FT)
PLMs
presents
barrier
for
many
research
groups
with
limited
resources.
Natural
processing
seen
similar
explosion
size
models,
where
these
challenges
have
addressed
methods
parameter-efficient
(PEFT).
In
this
work,
we
introduce
paradigm
proteomics
through
leveraging
method
LoRA
training
new
two
important
tasks:
predicting
protein–protein
interactions
(PPIs)
symmetry
homooligomer
quaternary
structures.
We
show
that
approaches
competitive
traditional
FT
while
requiring
reduced
substantially
fewer
parameters.
additionally
PPI
prediction
task,
only
classification
head
also
remains
full
FT,
using
five
orders
magnitude
parameters,
each
outperform
state-of-the-art
compute.
further
perform
comprehensive
evaluation
hyperparameter
space,
demonstrate
PEFT
is
robust
variations
hyperparameters,
elucidate
best
practices
differ
those
natural
processing.
All
our
adaptation
code
available
open-source
at
https://github.com/microsoft/peft_proteomics
.
Thus,
provide
blueprint
democratize
power
PLM
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: March 4, 2024
ABSTRACT
The
emergence
of
genomic
language
models
(gLMs)
offers
an
unsupervised
approach
to
learning
a
wide
diversity
cis
-regulatory
patterns
in
the
non-coding
genome
without
requiring
labels
functional
activity
generated
by
wet-lab
experiments.
Previous
evaluations
have
shown
that
pre-trained
gLMs
can
be
leveraged
improve
predictive
performance
across
broad
range
regulatory
genomics
tasks,
albeit
using
relatively
simple
benchmark
datasets
and
baseline
models.
Since
these
studies
were
tested
upon
fine-tuning
their
weights
for
each
downstream
task,
determining
whether
gLM
representations
embody
foundational
understanding
biology
remains
open
question.
Here
we
evaluate
representational
power
predict
interpret
cell-type-specific
data
span
DNA
RNA
regulation.
Our
findings
suggest
probing
do
not
offer
substantial
advantages
over
conventional
machine
approaches
use
one-hot
encoded
sequences.
This
work
highlights
major
gap
with
current
gLMs,
raising
potential
issues
pre-training
strategies
genome.
Frontiers in Bioengineering and Biotechnology,
Journal Year:
2025,
Volume and Issue:
13
Published: Feb. 5, 2025
The
integration
of
artificial
intelligence
(AI)
in
protein
design
presents
unparalleled
opportunities
for
innovation
bioengineering
and
biotechnology.
However,
it
also
raises
significant
biosecurity
concerns.
This
review
examines
the
changing
landscape
bioweapon
risks,
dual-use
potential
AI-driven
tools,
necessary
safeguards
to
prevent
misuse
while
fostering
innovation.
It
highlights
emerging
policy
frameworks,
technical
safeguards,
community
responses
aimed
at
mitigating
risks
enabling
responsible
development
application
AI
design.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: May 25, 2024
Abstract
Protein
design
has
important
implications
for
drug
discovery,
personalized
medicine,
and
biotechnology.
Models
based
on
multiple
sequence
alignments
efficiently
capture
the
evolutionary
information
in
homologous
protein
sequences,
but
alignment
construction
is
imperfect.
We
present
ProtMamba,
a
homology-aware
alignment-free
language
model
Mamba
architecture.
In
contrast
with
attention-based
models,
ProtMamba
handles
very
long
context,
comprising
hundreds
of
sequences.
train
large
dataset
concatenated
using
two
GPUs.
combine
autoregressive
modeling
masked
through
fill-in-the-middle
training
objective.
This
makes
adapted
to
various
applications.
demonstrate
ProtMamba’s
usefulness
generation
novel
sequences
fitness
prediction.
reaches
competitive
performance
other
models
despite
its
smaller
size,
which
sheds
light
importance
long-context
conditioning.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: June 5, 2024
Interpreting
function
and
fitness
effects
in
diverse
plant
genomes
requires
transferable
models.
Language
models
(LMs)
pre-trained
on
large-scale
biological
sequences
can
learn
evolutionary
conservation
offer
cross-species
prediction
better
than
supervised
through
fine-tuning
limited
labeled
data.
We
introduce
PlantCaduceus,
a
DNA
LM
based
the
Caduceus
Mamba
architectures,
curated
dataset
of
16
Angiosperm
genomes.
Fine-tuning
PlantCaduceus
Arabidopsis
data
for
four
tasks,
including
predicting
translation
initiation/termination
sites
splice
donor
acceptor
sites,
demonstrated
high
transferability
to
160
million
year
diverged
maize,
outperforming
best
existing
by
1.45
7.23-fold.
is
competitive
state-of-the-art
protein
LMs
terms
deleterious
mutation
identification,
threefold
PhyloP.
Additionally,
successfully
identifies
well-known
causal
variants
both
maize.
Overall,
versatile
that
accelerate
genomics
crop
breeding
applications.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: July 27, 2024
Deciphering
how
nucleotides
in
genomes
encode
regulatory
instructions
and
molecular
machines
is
a
long-standing
goal
biology.
DNA
language
models
(LMs)
implicitly
capture
functional
elements
their
organization
from
genomic
sequences
alone
by
modeling
probabilities
of
each
nucleotide
given
its
sequence
context.
However,
using
LMs
for
discovering
has
been
challenging
due
to
the
lack
interpretable
methods.
Here,
we
introduce
dependencies
which
quantify
substitutions
at
one
position
affect
other
positions.
We
generated
genome-wide
maps
pairwise
within
kilobase
ranges
animal,
fungal,
bacterial
species.
show
that
indicate
deleteriousness
human
genetic
variants
more
effectively
than
alignment
LM
reconstruction.
Regulatory
appear
as
dense
blocks
dependency
maps,
enabling
systematic
identification
transcription
factor
binding
sites
accurately
trained
on
experimental
data.
Nucleotide
also
highlight
bases
contact
RNA
structures,
including
pseudoknots
tertiary
structure
contacts,
with
remarkable
accuracy.
This
led
discovery
four
novel,
experimentally
validated
structures
Escherichia
coli.
Finally,
reveal
critical
limitations
several
architectures
training
selection
strategies
benchmarking
visual
diagnosis.
Altogether,
analysis
opens
new
avenue
studying
interactions
genomes.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Aug. 17, 2024
Abstract
Biological
language
model
performance
depends
heavily
on
pretraining
data
quality,
diversity,
and
size.
While
metagenomic
datasets
feature
enormous
biological
their
utilization
as
has
been
limited
due
to
challenges
in
accessibility,
quality
filtering
deduplication.
Here,
we
present
the
Open
MetaGenomic
(OMG)
corpus,
a
genomic
dataset
totalling
3.1T
base
pairs
3.3B
protein
coding
sequences,
obtained
by
combining
two
largest
repositories
(JGI’s
IMG
EMBL’s
MGnify).
We
first
document
composition
of
describe
steps
taken
remove
poor
data.
make
OMG
corpus
available
mixed-modality
sequence
that
represents
multi-gene
encoding
sequences
with
translated
amino
acids
for
nucleic
intergenic
sequences.
train
(gLM2)
leverages
context
information
learn
robust
functional
representations,
well
coevolutionary
signals
protein-protein
interfaces
regulatory
syntax.
Furthermore,
show
deduplication
embedding
space
can
be
used
balance
demonstrating
improved
downstream
tasks.
The
is
publicly
hosted
Hugging
Face
Hub
at
https://huggingface.co/datasets/tattabio/OMG
gLM2
https://huggingface.co/tattabio/gLM2_650M
.