bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2025,
Volume and Issue:
unknown
Published: Jan. 13, 2025
Abstract
Background
Selection
of
individuals
based
on
their
estimated
breeding
values
aims
to
maximize
response
selection
the
next
generation
in
additive
model.
However,
when
aim
is
not
only
about
short-term
population-wide
genetic
gain
but
also
over
multiple
generations,
an
optimal
strategy
as
clear-cut,
maintenance
diversity
may
become
important
factor.
This
study
provides
extended
comparison
existing
strategies
a
unifying
testing
pipeline
using
simulation
software
MoBPS.
Results
Applying
weighting
factor
SNP
effects
frequency
beneficial
allele
resulted
increase
long-term
1.6%
after
50
generations
while
reducing
inbreeding
rates
by
16.2%
compared
truncation
values.
this
losses
1.2%
with
break-even
point
reached
25
generations.
In
contrast,
inclusion
average
kinship
individual
top
population
additional
trait
index
weight
17.5%
no
and
increased
gains
4.3%
15.8%,
achieving
very
similar
efficiency
use
optimum
contribution
selection.
Combining
management
strategies,
weights
for
each
optimized
evolutionary
algorithm
scheme
5.1%
37.3%
reduced
rates.
The
proposed
included
contribution,
frequency,
index,
avoiding
matings
between
related
individuals,
lowering
proportion
selected
individuals.
Conclusions
combination
was
shown
be
far
superior
any
singular
method
tested
study.
As
efficient
methods
does
necessarily
lead
comes
at
extra
costs,
it
critical
companies
implement
such
success.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2022,
Volume and Issue:
unknown
Published: April 10, 2022
Abstract
We
consider
the
problem
of
predicting
a
protein
sequence
from
its
backbone
atom
coordinates.
Machine
learning
approaches
to
this
date
have
been
limited
by
number
available
experimentally
determined
structures.
augment
training
data
nearly
three
orders
magnitude
structures
for
12M
sequences
using
AlphaFold2.
Trained
with
additional
data,
sequence-to-sequence
transformer
invariant
geometric
input
processing
layers
achieves
51%
native
recovery
on
structurally
held-out
backbones
72%
buried
residues,
an
overall
improvement
almost
10
percentage
points
over
existing
methods.
The
model
generalizes
variety
more
complex
tasks
including
design
complexes,
partially
masked
structures,
binding
interfaces,
and
multiple
states.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2023,
Volume and Issue:
unknown
Published: June 12, 2023
Abstract
Gene
prediction
has
remained
an
active
area
of
bioinformatics
research
for
a
long
time.
Still,
gene
in
large
eukaryotic
genomes
presents
challenge
that
must
be
addressed
by
new
algorithms.
The
amount
and
significance
the
evidence
available
from
transcriptomes
proteomes
vary
across
genomes,
between
genes
even
along
single
gene.
User-friendly
accurate
annotation
pipelines
can
cope
with
such
data
heterogeneity
are
needed.
previously
developed
BRAKER1
BRAKER2
use
RNA-seq
or
protein
data,
respectively,
but
not
both.
A
further
significant
performance
improvement
was
made
recently
released
GeneMark-ETP
integrating
all
three
types.
We
here
present
BRAKER3
pipeline
builds
on
AUGUSTUS
improves
accuracy
using
TSEBRA
combiner.
annotates
protein-coding
both
short-read
database,
statistical
models
learned
iteratively
specifically
target
genome.
benchmarked
11
species
under
assumed
level
relatedness
proteome
to
proteomes.
outperformed
BRAKER2.
average
transcript-level
F1-score
increased
∼
20
percentage
points
average,
while
difference
most
pronounced
withlarge
complex
genomes.
also
other
existing
tools,
MAKER2,
Funannotate
FINDER.
code
is
GitHub
as
ready-to-run
Docker
container
execution
Singularity.
Overall,
accurate,
easy-to-use
tool
genome
annotation.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2023,
Volume and Issue:
unknown
Published: Jan. 15, 2023
Closing
the
gap
between
measurable
genetic
information
and
observable
traits
is
a
longstanding
challenge
in
genomics.
Yet,
prediction
of
molecular
phenotypes
from
DNA
sequences
alone
remains
limited
inaccurate,
often
driven
by
scarcity
annotated
data
inability
to
transfer
learnings
tasks.
Here,
we
present
an
extensive
study
foundation
models
pre-trained
on
sequences,
named
Nucleotide
Transformer,
ranging
50M
up
2.5B
parameters
integrating
3,202
diverse
human
genomes,
as
well
850
genomes
selected
across
phyla,
including
both
model
non-model
organisms.
These
transformer
yield
transferable,
context-specific
representations
nucleotide
which
allow
for
accurate
phenotype
even
low-data
settings.
We
show
that
developed
can
be
fine-tuned
at
low
cost
despite
available
regime
solve
variety
genomics
applications.
Despite
no
supervision,
learned
focus
attention
key
genomic
elements,
those
regulate
gene
expression,
such
enhancers.
Lastly,
demonstrate
utilizing
improve
prioritization
functional
variants.
The
training
application
foundational
explored
this
provide
widely
applicable
stepping
stone
bridge
sequence.
Code
weights
at:
https://github.com/instadeepai/nucleotide-transformer
Jax
https://huggingface.co/InstaDeepAI
Pytorch.
Example
notebooks
apply
these
any
downstream
task
are
https://huggingface.co/docs/transformers/notebooks#pytorch-bio.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2023,
Volume and Issue:
unknown
Published: Oct. 2, 2023
A
bstract
Large-scale
protein
language
models
(PLMs),
such
as
the
ESM
family,
have
achieved
remarkable
performance
in
various
downstream
tasks
related
to
structure
and
function
by
undergoing
unsupervised
training
on
residue
sequences.
They
become
essential
tools
for
researchers
practitioners
biology.
However,
a
limitation
of
vanilla
PLMs
is
their
lack
explicit
consideration
information,
which
suggests
potential
further
improvement.
Motivated
this,
we
introduce
concept
“
s
tructure-
ware
vocabulary”
that
integrates
tokens
with
tokens.
The
are
derived
encoding
3D
proteins
using
Foldseek.
We
then
propose
SaProt,
large-scale
general-purpose
PLM
trained
an
extensive
dataset
comprising
approximately
40
million
sequences
structures.
Through
evaluation,
our
SaProt
model
surpasses
well-established
renowned
baselines
across
10
significant
tasks,
demonstrating
its
exceptional
capacity
broad
applicability.
made
code
1
,
pre-trained
model,
all
relevant
materials
available
at
https://github.com/westlake-repl/SaProt
.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2020,
Volume and Issue:
unknown
Published: July 12, 2020
Abstract
Computational
biology
and
bioinformatics
provide
vast
data
gold-mines
from
protein
sequences,
ideal
for
Language
Models
taken
NLP.
These
LMs
reach
new
prediction
frontiers
at
low
inference
costs.
Here,
we
trained
two
auto-regressive
models
(Transformer-XL,
XLNet)
four
auto-encoder
(BERT,
Albert,
Electra,
T5)
on
UniRef
BFD
containing
up
to
393
billion
amino
acids.
The
were
the
Summit
supercomputer
using
5616
GPUs
TPU
Pod
up-to
1024
cores.
Dimensionality
reduction
revealed
that
raw
LM-
embeddings
unlabeled
captured
some
biophysical
features
of
sequences.
We
validated
advantage
as
exclusive
input
several
subsequent
tasks.
first
was
a
per-residue
secondary
structure
(3-state
accuracy
Q3=81%-87%);
second
per-protein
predictions
sub-cellular
localization
(ten-state
accuracy:
Q10=81%)
membrane
vs.
water-soluble
(2-state
Q2=91%).
For
transfer
most
informative
(ProtT5)
time
outperformed
state-of-the-art
without
evolutionary
information
thereby
bypassing
expensive
database
searches.
Taken
together,
results
implied
learned
grammar
language
life
.
To
facilitate
future
work,
released
our
https://github.com/agemagician/ProtTrans
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2021,
Volume and Issue:
unknown
Published: Nov. 11, 2021
Abstract
Machine
learning
could
enable
an
unprecedented
level
of
control
in
protein
engineering
for
therapeutic
and
industrial
applications.
Critical
to
its
use
designing
proteins
with
desired
properties,
machine
models
must
capture
the
sequence-function
relationship,
often
termed
fitness
landscape
.
Existing
bench-marks
like
CASP
or
CAFA
assess
structure
function
predictions
proteins,
respectively,
yet
they
do
not
target
metrics
relevant
engineering.
In
this
work,
we
introduce
Fitness
Landscape
Inference
Proteins
(FLIP),
a
benchmark
prediction
encourage
rapid
scoring
representation
Our
curated
tasks,
baselines,
probe
model
generalization
settings
engineering,
e.g.
low-resource
extrapolative.
Currently,
FLIP
encompasses
experimental
data
across
adeno-associated
virus
stability
gene
therapy,
domain
B1
immunoglobulin
binding,
thermostability
from
multiple
families.
order
ease
future
expansion
new
all
are
presented
standard
format.
scripts
freely
accessible
at
https://benchmark.protein.properties
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2020,
Volume and Issue:
unknown
Published: April 13, 2020
Abstract
A
central
mechanism
in
machine
learning
is
to
identify,
store,
and
recognize
patterns.
How
learn,
access,
retrieve
such
patterns
crucial
Hopfield
networks
the
more
recent
transformer
architectures.
We
show
that
attention
of
architectures
actually
update
rule
modern
Hop-field
can
store
exponentially
many
exploit
this
high
storage
capacity
solve
a
challenging
multiple
instance
(MIL)
problem
computational
biology:
immune
repertoire
classification.
Accurate
interpretable
methods
solving
could
pave
way
towards
new
vaccines
therapies,
which
currently
very
relevant
research
topic
intensified
by
COVID-19
crisis.
Immune
classification
based
on
vast
number
immunosequences
an
individual
MIL
with
unprecedentedly
massive
instances,
two
orders
magnitude
larger
than
considered
problems,
extremely
low
witness
rate.
In
work,
we
present
our
novel
method
DeepRC
integrates
transformer-like
attention,
or
equivalently
networks,
into
deep
for
as
demonstrate
outperforms
all
other
respect
predictive
performance
large-scale
experiments,
including
simulated
real-world
virus
infection
data,
enables
extraction
sequence
motifs
are
connected
given
disease
class.
Source
code
datasets:
https://github.com/ml-jku/DeepRC
Circulation Research,
Journal Year:
2022,
Volume and Issue:
131(1), P. 91 - 105
Published: May 16, 2022
Background:
Cellular
redox
control
is
maintained
by
generation
of
reactive
oxygen/nitrogen
species
balanced
activation
antioxidative
pathways.
Disruption
balance
leads
to
oxidative
stress,
a
central
causative
event
in
numerous
diseases
including
heart
failure.
Redox
the
exposed
hemodynamic
however,
remains
be
fully
elucidated.
Methods:
Pressure
overload
was
triggered
transverse
aortic
constriction
mice.
Transcriptomic
and
metabolomic
regulations
were
evaluated
RNA-sequencing
metabolomics,
respectively.
Stable
isotope
tracer
labeling
experiments
conducted
determine
metabolic
flux
vitro.
Neonatal
rat
ventricular
myocytes
H9c2
cells
used
examine
molecular
mechanisms.
Results:
We
show
that
production
cardiomyocyte
NADPH,
key
factor
regulation,
decreased
pressure
overload-induced
As
consequence,
level
reduced
glutathione
downregulated,
change
associated
with
fibrosis
cardiomyopathy.
report
pentose
phosphate
pathway
mitochondrial
serine/glycine/folate
signaling,
2
NADPH-generating
pathways
cytosol
mitochondria,
respectively,
are
induced
constriction.
identify
ATF4
(activating
transcription
4)
as
an
upstream
controlling
expression
multiple
enzymes
these
Consistently,
joint
analysis
transcriptomic
data
reveal
preferably
controls
stress
redox-related
Overexpression
neonatal
increases
NADPH-producing
enzymes‚
whereas
silencing
decreases
their
expression.
Further,
stable
overexpression
augments
within
In
vivo,
cardiomyocyte-specific
deletion
exacerbates
cardiomyopathy
setting
accelerates
failure
development,
attributable,
at
least
part,
inability
increase
enzymes.
Conclusions:
Our
findings
plays
critical
role
under
conditions
governing
both
cytosolic
NADPH.