Genome biology,
Journal Year:
2023,
Volume and Issue:
24(1)
Published: March 27, 2023
Abstract
Background
The
largest
sequence-based
models
of
transcription
control
to
date
are
obtained
by
predicting
genome-wide
gene
regulatory
assays
across
the
human
genome.
This
setting
is
fundamentally
correlative,
as
those
exposed
during
training
solely
sequence
variation
between
genes
that
arose
through
evolution,
questioning
extent
which
capture
genuine
causal
signals.
Results
Here
we
confront
predictions
state-of-the-art
regulation
against
data
from
two
large-scale
observational
studies
and
five
deep
perturbation
assays.
most
advanced
these
models,
Enformer,
large,
captures
determinants
promoters.
However,
fail
effects
enhancers
on
expression,
notably
in
medium
long
distances
particularly
for
highly
expressed
More
generally,
predicted
impact
distal
elements
expression
small
ability
correctly
integrate
long-range
information
significantly
more
limited
than
receptive
fields
suggest.
likely
caused
escalating
class
imbalance
actual
candidate
distance
increases.
Conclusions
Our
results
suggest
have
point
silico
study
promoter
regions
variants
can
provide
meaningful
insights
practical
guidance
how
use
them.
Moreover,
foresee
it
will
require
new
kinds
train
accurately
accounting
elements.
Frontiers in Genetics,
Journal Year:
2022,
Volume and Issue:
13
Published: July 25, 2022
The
same
genetic
variant
found
in
different
individuals
can
cause
a
range
of
diverse
phenotypes,
from
no
discernible
clinical
phenotype
to
severe
disease,
even
among
related
individuals.
Such
variants
be
said
display
incomplete
penetrance,
binary
phenomenon
where
the
genotype
either
causes
expected
or
it
does
not,
they
variable
expressivity,
which
wide
symptoms
across
spectrum.
Both
penetrance
and
expressivity
are
thought
caused
by
factors,
including
common
variants,
regulatory
regions,
epigenetics,
environmental
lifestyle.
Many
thousands
have
been
identified
as
monogenic
disorders,
mostly
determined
through
small
studies,
thus,
these
may
overestimated
when
compared
their
effect
on
general
population.
With
wealth
population
cohort
data
currently
available,
such
investigated
much
wider
contingent,
potentially
helping
reclassify
that
were
previously
completely
penetrant.
Research
into
is
important
for
classification,
both
determining
causative
mechanisms
disease
affected
providing
accurate
risk
information
counseling.
A
genotype-based
definition
rare
diseases
incorporating
cohorts
studies
critical
our
understanding
expressivity.
This
review
examines
current
knowledge
populations,
well
looking
potential
variation
seen,
modifiers,
mosaicism,
polygenic
others.
We
also
considered
challenges
come
with
investigating
Science,
Journal Year:
2022,
Volume and Issue:
375(6586), P. 1247 - 1254
Published: March 17, 2022
Associations
between
genetic
variation
and
traits
are
often
in
noncoding
regions
with
strong
linkage
disequilibrium
(LD),
where
a
single
causal
variant
is
assumed
to
underlie
the
association.
We
applied
massively
parallel
reporter
assay
(MPRA)
functionally
evaluate
variants
high,
local
LD
for
independent
cis-expression
quantitative
trait
loci
(eQTL).
found
that
17.7%
of
eQTLs
exhibit
more
than
one
major
allelic
effect
tight
LD.
The
detected
regulatory
were
highly
specifically
enriched
activating
chromatin
structures
transcription
factor
binding.
Integration
MPRA
profiles
eQTL/complex
colocalizations
across
114
human
diseases
identified
sets
demonstrating
how
association
signals
can
manifest
through
multiple,
tightly
linked
variants.
Nucleic Acids Research,
Journal Year:
2024,
Volume and Issue:
52(D1), P. D1143 - D1154
Published: Jan. 5, 2024
Machine
Learning-based
scoring
and
classification
of
genetic
variants
aids
the
assessment
clinical
findings
is
employed
to
prioritize
in
diverse
studies
analyses.
Combined
Annotation-Dependent
Depletion
(CADD)
one
first
methods
for
genome-wide
prioritization
across
different
molecular
functions
has
been
continuously
developed
improved
since
its
original
publication.
Here,
we
present
our
most
recent
release,
CADD
v1.7.
We
explored
integrated
new
annotation
features,
among
them
state-of-the-art
protein
language
model
scores
(Meta
ESM-1v),
regulatory
variant
effect
predictions
(from
sequence-based
convolutional
neural
networks)
sequence
conservation
(Zoonomia).
evaluated
version
on
data
sets
derived
from
ClinVar,
ExAC/gnomAD
1000
Genomes
variants.
For
coding
effects,
tested
31
Deep
Mutational
Scanning
(DMS)
ProteinGym
and,
prediction,
used
saturation
mutagenesis
reporter
assay
promoter
enhancer
sequences.
The
inclusion
features
further
overall
performance
CADD.
As
with
previous
releases,
all
sets,
v1.7
scores,
scripts
on-site
an
easy-to-use
webserver
are
readily
provided
via
https://cadd.bihealth.org/
or
https://cadd.gs.washington.edu/
community.
Nature Communications,
Journal Year:
2022,
Volume and Issue:
13(1)
Published: Nov. 24, 2022
Machine
learning
and
in
particular
deep
(DL)
are
increasingly
important
mass
spectrometry
(MS)-based
proteomics.
Recent
DL
models
can
predict
the
retention
time,
ion
mobility
fragment
intensities
of
a
peptide
just
from
amino
acid
sequence
with
good
accuracy.
However,
is
very
rapidly
developing
field
new
neural
network
architectures
frequently
appearing,
which
challenging
to
incorporate
for
proteomics
researchers.
Here
we
introduce
AlphaPeptDeep,
modular
Python
framework
built
on
PyTorch
library
that
learns
predicts
properties
peptides
(
https://github.com/MannLabs/alphapeptdeep
).
It
features
model
shop
enables
non-specialists
create
few
lines
code.
AlphaPeptDeep
represents
post-translational
modifications
generic
manner,
even
if
only
chemical
composition
known.
Extensive
use
transfer
obviates
need
large
data
sets
refine
experimental
conditions.
The
predicting
collisional
cross
sections
at
least
par
existing
tools.
Additional
sequence-based
also
be
predicted
by
as
demonstrated
HLA
prediction
improve
identification
data-independent
acquisition
https://github.com/MannLabs/PeptDeep-HLA
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2023,
Volume and Issue:
unknown
Published: Jan. 15, 2023
Closing
the
gap
between
measurable
genetic
information
and
observable
traits
is
a
longstanding
challenge
in
genomics.
Yet,
prediction
of
molecular
phenotypes
from
DNA
sequences
alone
remains
limited
inaccurate,
often
driven
by
scarcity
annotated
data
inability
to
transfer
learnings
tasks.
Here,
we
present
an
extensive
study
foundation
models
pre-trained
on
sequences,
named
Nucleotide
Transformer,
ranging
50M
up
2.5B
parameters
integrating
3,202
diverse
human
genomes,
as
well
850
genomes
selected
across
phyla,
including
both
model
non-model
organisms.
These
transformer
yield
transferable,
context-specific
representations
nucleotide
which
allow
for
accurate
phenotype
even
low-data
settings.
We
show
that
developed
can
be
fine-tuned
at
low
cost
despite
available
regime
solve
variety
genomics
applications.
Despite
no
supervision,
learned
focus
attention
key
genomic
elements,
those
regulate
gene
expression,
such
enhancers.
Lastly,
demonstrate
utilizing
improve
prioritization
functional
variants.
The
training
application
foundational
explored
this
provide
widely
applicable
stepping
stone
bridge
sequence.
Code
weights
at:
https://github.com/instadeepai/nucleotide-transformer
Jax
https://huggingface.co/InstaDeepAI
Pytorch.
Example
notebooks
apply
these
any
downstream
task
are
https://huggingface.co/docs/transformers/notebooks#pytorch-bio.
Bioinformatics Advances,
Journal Year:
2023,
Volume and Issue:
3(1)
Published: Jan. 1, 2023
Abstract
Summary
The
transformer-based
language
models,
including
vanilla
transformer,
BERT
and
GPT-3,
have
achieved
revolutionary
breakthroughs
in
the
field
of
natural
processing
(NLP).
Since
there
are
inherent
similarities
between
various
biological
sequences
languages,
remarkable
interpretability
adaptability
these
models
prompted
a
new
wave
their
application
bioinformatics
research.
To
provide
timely
comprehensive
review,
we
introduce
key
developments
by
describing
detailed
structure
transformers
summarize
contribution
to
wide
range
research
from
basic
sequence
analysis
drug
discovery.
While
applications
diverse
multifaceted,
identify
discuss
common
challenges,
heterogeneity
training
data,
computational
expense
model
interpretability,
opportunities
context
We
hope
that
broader
community
NLP
researchers,
bioinformaticians
biologists
will
be
brought
together
foster
future
development
inspire
novel
unattainable
traditional
methods.
Supplementary
information
data
available
at
Bioinformatics
Advances
online.