bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: April 26, 2024
Abstract
Inferring
the
demographic
history
of
populations
provides
fundamental
insights
into
species
dynamics
and
is
essential
for
developing
a
null
model
to
accurately
study
selective
processes.
However,
background
selection
sweeps
can
produce
genomic
signatures
at
linked
sites
that
mimic
or
mask
signals
associated
with
historical
population
size
change.
While
theoretical
biases
introduced
by
effects
have
been
well
established,
it
unclear
whether
ARG-based
approaches
inference
in
typical
empirical
analyses
are
susceptible
mis-inference
due
these
effects.
To
address
this,
we
developed
highly
realistic
forward
simulations
human
Drosophila
melanogaster
populations,
including
empirically
estimated
variability
gene
density,
mutation
rates,
recombination
purifying
positive
selection,
across
different
scenarios,
broadly
assess
impact
on
using
genealogy-based
approach.
Our
results
indicate
minimally
though
could
cause
similar
genome
architecture
parameters
experiencing
more
frequent
recurrent
sweeps.
We
found
accurate
D.
methods
compromised
presence
pervasive
alone,
leading
spurious
inferences
recent
expansion
which
may
be
further
worsened
sweeps,
depending
proportion
strength
beneficial
mutations.
Caution
additional
testing
species-specific
needed
when
inferring
non-human
avoid
selection.
Molecular Biology and Evolution,
Journal Year:
2024,
Volume and Issue:
41(7)
Published: June 14, 2024
Inferring
the
demographic
history
of
populations
provides
fundamental
insights
into
species
dynamics
and
is
essential
for
developing
a
null
model
to
accurately
study
selective
processes.
However,
background
selection
sweeps
can
produce
genomic
signatures
at
linked
sites
that
mimic
or
mask
signals
associated
with
historical
population
size
change.
While
theoretical
biases
introduced
by
effects
have
been
well
established,
it
unclear
whether
ancestral
recombination
graph
(ARG)-based
approaches
inference
in
typical
empirical
analyses
are
susceptible
misinference
due
these
effects.
To
address
this,
we
developed
highly
realistic
forward
simulations
human
Drosophila
melanogaster
populations,
including
empirically
estimated
variability
gene
density,
mutation
rates,
purifying,
positive
selection,
across
different
scenarios,
broadly
assess
impact
on
using
genealogy-based
approach.
Our
results
indicate
minimally
although
could
cause
similar
genome
architecture
parameters
experiencing
more
frequent
recurrent
sweeps.
We
found
accurate
D.
ARG-based
methods
compromised
presence
pervasive
alone,
leading
spurious
inferences
recent
expansion,
which
may
be
further
worsened
sweeps,
depending
proportion
strength
beneficial
mutations.
Caution
additional
testing
species-specific
needed
when
inferring
non-human
avoid
selection.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Dec. 2, 2024
Foreshadowing
haplotype-based
methods
of
the
genomics
era,
it
is
an
old
observation
that
"junction"
between
two
distinct
haplotypes
produced
by
recombination
inherited
as
a
Mendelian
marker.
In
genealogical
context,
this
recombination-mediated
information
reflects
persistence
ancestral
across
local
trees
in
which
they
do
not
represent
coalescences.
We
show
how
these
non-coalescing
("locally-unary
nodes")
may
be
inserted
into
graphs
(ARGs),
compact
but
information-rich
data
structure
describing
relationships
among
recombinant
sequences.
The
resulting
ARGs
are
smaller,
faster
to
compute
with,
and
additional
nearly
always
correct
where
initial
ARG
correct.
provide
efficient
algorithms
infer
locally-unary
nodes
within
existing
ARGs,
explore
some
consequences
for
inferred
from
real
data.
To
this,
we
introduce
new
metrics
agreement
disagreement
that,
unlike
previous
methods,
consider
rather
than
just
collection
trees.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2022,
Volume and Issue:
unknown
Published: Sept. 30, 2022
Abstract
The
reproductive
mechanism
of
a
species
is
key
driver
genome
evolution.
standard
Wright-Fisher
model
for
the
reproduction
individuals
in
population
assumes
that
each
individual
produces
number
offspring
negligible
compared
to
total
size.
Yet
many
plants,
invertebrates,
prokaryotes
or
fish
exhibit
neutrally
skewed
distribution
strong
selection
events
yielding
few
produce
up
same
magnitude
as
As
result,
genealogy
sample
characterized
by
multiple
(more
than
two)
coalescing
simultaneously
common
ancestor.
current
methods
developed
detect
such
merger
do
not
account
complex
demographic
scenarios
recombination,
and
require
large
sizes.
We
tackle
these
limitations
developing
two
novel
different
approaches
infer
from
sequence
data
ancestral
recombination
graph
(ARG):
sequentially
Markovian
coalescent
(SM
β
C)
neural
network
(GNN
coal
).
first
give
proof
accuracy
our
estimate
parameter
past
history
using
simulated
under
-coalescent
model.
Secondly,
we
show
can
also
recover
effect
positive
selective
sweeps
along
genome.
Finally,
are
able
distinguish
while
inferring
variation
Our
findings
stress
aptitude
networks
leverage
information
ARG
inference
but
urgent
need
more
accurate
approaches.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: June 12, 2024
Abstract
Background
Variant
Call
Format
(VCF)
is
the
standard
file
format
for
interchanging
genetic
variation
data
and
associated
quality
control
metrics.
The
usual
row-wise
encoding
of
VCF
model
(either
as
text
or
packed
binary)
emphasises
efficient
retrieval
all
a
given
variant,
but
accessing
on
field
sample
basis
inefficient.
Biobank
scale
datasets
currently
available
consist
hundreds
thousands
whole
genomes
terabytes
compressed
VCF.
Row-wise
storage
fundamentally
unsuitable
more
scalable
approach
needed.
Results
Zarr
storing
multi-dimensional
that
widely
used
across
sciences,
ideally
suited
to
massively
parallel
processing.
We
present
specification,
an
using
Zarr,
along
with
fundamental
software
infrastructure
reliable
conversion
at
scale.
show
how
this
far
than
based
approaches,
competitive
specialised
methods
genotype
in
terms
compression
ratios
single-threaded
calculation
performance.
case
studies
subsets
three
large
human
(Genomics
England:
n
=78,195;
Our
Future
Health:
=651,050;
All
Us:
=245,394)
genome
Norway
Spruce
(
=1,063)
SARS-CoV-2
=4,484,157).
demonstrate
potential
enable
new
generation
high-performance
cost-effective
applications
via
illustrative
examples
cloud
computing
GPUs.
Conclusions
Large
row-encoded
files
are
major
bottleneck
current
research,
processing
these
incurs
substantial
cost.
building
widely-used,
open-source
technologies
has
greatly
reduce
costs,
may
diverse
ecosystem
next-generation
tools
analysing
directly
from
cloud-based
object
stores,
while
maintaining
compatibility
existing
file-oriented
workflows.
Key
Points
supported,
underlying
entrenched
bioinformatics
pipelines.
(or
inherently
inefficient
large-scale
provides
solution,
by
fields
separately
chunk-compressed
binary
format.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: April 26, 2024
Abstract
Inferring
the
demographic
history
of
populations
provides
fundamental
insights
into
species
dynamics
and
is
essential
for
developing
a
null
model
to
accurately
study
selective
processes.
However,
background
selection
sweeps
can
produce
genomic
signatures
at
linked
sites
that
mimic
or
mask
signals
associated
with
historical
population
size
change.
While
theoretical
biases
introduced
by
effects
have
been
well
established,
it
unclear
whether
ARG-based
approaches
inference
in
typical
empirical
analyses
are
susceptible
mis-inference
due
these
effects.
To
address
this,
we
developed
highly
realistic
forward
simulations
human
Drosophila
melanogaster
populations,
including
empirically
estimated
variability
gene
density,
mutation
rates,
recombination
purifying
positive
selection,
across
different
scenarios,
broadly
assess
impact
on
using
genealogy-based
approach.
Our
results
indicate
minimally
though
could
cause
similar
genome
architecture
parameters
experiencing
more
frequent
recurrent
sweeps.
We
found
accurate
D.
methods
compromised
presence
pervasive
alone,
leading
spurious
inferences
recent
expansion
which
may
be
further
worsened
sweeps,
depending
proportion
strength
beneficial
mutations.
Caution
additional
testing
species-specific
needed
when
inferring
non-human
avoid
selection.