Genomic language models: opportunities and challenges
Gonzalo Benegas,
No information about this author
Chengzhong Ye,
No information about this author
Carlos Albors
No information about this author
et al.
Trends in Genetics,
Journal Year:
2025,
Volume and Issue:
unknown
Published: Jan. 1, 2025
Language: Английский
Are protein language models the new universal key?
Current Opinion in Structural Biology,
Journal Year:
2025,
Volume and Issue:
91, P. 102997 - 102997
Published: Feb. 7, 2025
Protein
language
models
(pLMs)
capture
some
aspects
of
the
grammar
life
as
written
in
protein
sequences.
The
so-called
pLM
embeddings
implicitly
contain
this
information.
Therefore,
can
serve
exclusive
input
into
downstream
supervised
methods
for
prediction.
Over
last
33
years,
evolutionary
information
extracted
through
simple
averaging
specific
families
from
multiple
sequence
alignments
(MSAs)
has
been
most
successful
universal
key
to
success
For
many
applications,
MSA-free
pLM-based
predictions
now
have
become
significantly
more
accurate.
reason
is
often
a
combination
two
aspects.
Firstly,
condense
so
efficiently
that
prediction
succeed
with
small
models,
i.e.,
they
need
few
free
parameters
particular
era
exploding
deep
neural
networks.
Secondly,
provide
protein-specific
solutions.
As
additional
benefit,
once
pre-training
complete,
solutions
tend
consume
much
fewer
resources
than
MSA-based
In
fact,
we
appeal
community
rather
optimize
foundation
retrain
new
ones
and
evolve
incentives
require
even
at
loss
accuracy.
Although
pLMs
not,
yet,
succeeded
entirely
replace
body
developed
over
three
decades,
clearly
are
rapidly
advancing
Language: Английский
Teaching AI to speak protein
Current Opinion in Structural Biology,
Journal Year:
2025,
Volume and Issue:
91, P. 102986 - 102986
Published: Feb. 21, 2025
Language: Английский
The OMG dataset: An Open MetaGenomic corpus for mixed-modality genomic language modeling
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Aug. 17, 2024
Abstract
Biological
language
model
performance
depends
heavily
on
pretraining
data
quality,
diversity,
and
size.
While
metagenomic
datasets
feature
enormous
biological
their
utilization
as
has
been
limited
due
to
challenges
in
accessibility,
quality
filtering
deduplication.
Here,
we
present
the
Open
MetaGenomic
(OMG)
corpus,
a
genomic
dataset
totalling
3.1T
base
pairs
3.3B
protein
coding
sequences,
obtained
by
combining
two
largest
repositories
(JGI’s
IMG
EMBL’s
MGnify).
We
first
document
composition
of
describe
steps
taken
remove
poor
data.
make
OMG
corpus
available
mixed-modality
sequence
that
represents
multi-gene
encoding
sequences
with
translated
amino
acids
for
nucleic
intergenic
sequences.
train
(gLM2)
leverages
context
information
learn
robust
functional
representations,
well
coevolutionary
signals
protein-protein
interfaces
regulatory
syntax.
Furthermore,
show
deduplication
embedding
space
can
be
used
balance
demonstrating
improved
downstream
tasks.
The
is
publicly
hosted
Hugging
Face
Hub
at
https://huggingface.co/datasets/tattabio/OMG
gLM2
https://huggingface.co/tattabio/gLM2_650M
.
Language: Английский