bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Май 27, 2024
Abstract
The
field
of
population
genetics
attempts
to
advance
our
understanding
evolutionary
processes.
It
has
applications,
for
example,
in
medical
research,
wildlife
conservation,
and
–
conjunction
with
recent
advances
ancient
DNA
sequencing
technology
studying
human
migration
patterns
over
the
past
few
thousand
years.
basic
toolbox
includes
genealogical
tress,
which
describe
shared
history
among
individuals
same
species.
They
are
calculated
on
basis
genetic
variations.
However,
recombining
organisms,
a
single
tree
is
insufficient
whole
genome.
Instead,
collection
correlated
trees
can
be
used,
where
each
describes
consecutive
region
current
corresponding
state
of-the-art
data
structure,
sequences,
compresses
these
via
edit
operations
when
moving
from
one
next
along
genome
instead
storing
full,
often
redundant,
description
tree.
We
propose
new
forests,
set
into
DAG.
In
this
DAG
identical
subtrees
that
across
input
encoded
only
once,
thereby
allowing
straight-forward
memoization
intermediate
results.
Additionally,
we
provide
C++
implementation
proposed
called
gfkit
,
2.1
11.2
(median
4.0)
times
faster
than
state-of-the-art
tool
empirical
simulated
datasets
at
computing
important
statistics
such
as
Allele
Frequency
Spectrum,
Patterson’s
f
Fixation
Index,
Tajima’s
D
pairwise
Lowest
Common
Ancestors,
others.
On
Ancestor
queries
more
two
samples
input,
scales
asymptotically
better
state-of-the-art,
thus
up
990
faster.
conclusion,
structure
by
enabling
results,
yielding
substantial
runtime
reduction
potentially
intuitive
representation
state-of-the-art.
Our
improvements
will
boost
development
novel
analyses
models
increases
scalability
ever-growing
genomic
datasets.
2012
ACM
Subject
Classification
Applied
→
Computational
genomics;
Molecular
sequence
analysis;
Bioinformatics;
Population
PLoS Genetics,
Год журнала:
2024,
Номер
20(1), С. e1011110 - e1011110
Опубликована: Янв. 18, 2024
In
the
presence
of
recombination,
evolutionary
relationships
between
a
set
sampled
genomes
cannot
be
described
by
single
genealogical
tree.
Instead,
are
related
complex,
interwoven
collection
genealogies
formalized
in
structure
called
an
ancestral
recombination
graph
(ARG).
An
ARG
extensively
encodes
ancestry
genome(s)
and
thus
is
replete
with
valuable
information
for
addressing
diverse
questions
biology.
Despite
its
potential
utility,
technological
methodological
limitations,
along
lack
approachable
literature,
have
severely
restricted
awareness
application
ARGs
evolution
research.
Excitingly,
recent
progress
reconstruction
simulation
made
ARG-based
approaches
feasible
many
systems.
this
review,
we
provide
accessible
introduction
exploration
ARGs,
survey
breakthroughs,
describe
to
further
existing
goals
open
avenues
inquiry
that
were
previously
inaccessible
genomics.
Through
discussion,
aim
more
widely
disseminate
promise
genomics
encourage
broader
development
adoption
inference.
Peer Community Journal,
Год журнала:
2024,
Номер
4
Опубликована: Март 18, 2024
The
reproductive
mechanism
of
a
species
is
key
driver
genome
evolution.
standard
Wright-Fisher
model
for
the
reproduction
individuals
in
population
assumes
that
each
individual
produces
number
offspring
negligible
compared
to
total
size.
Yet
many
plants,
invertebrates,
prokaryotes
or
fish
exhibit
neutrally
skewed
distribution
strong
selection
events
yielding
few
produce
up
same
magnitude
as
As
result,
genealogy
sample
characterized
by
multiple
(more
than
two)
coalescing
simultaneously
common
ancestor.
current
methods
developed
detect
such
merger
do
not
account
complex
demographic
scenarios
recombination,
and
require
large
sizes.
We
tackle
these
limitations
developing
two
novel
different
approaches
infer
from
sequence
data
ancestral
recombination
graph
(ARG):
sequentially
Markovian
coalescent
(SMβC)
neural
network
(GNNcoal).
first
give
proof
accuracy
our
estimate
parameter
past
history
using
simulated
under
β-coalescent
model.
Secondly,
we
show
can
also
recover
effect
positive
selective
sweeps
along
genome.
Finally,
are
able
distinguish
while
inferring
variation
Our
findings
stress
aptitude
networks
leverage
information
ARG
inference
but
urgent
need
more
accurate
approaches.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Март 29, 2024
Describing
the
distribution
of
genetic
variation
across
individuals
is
a
fundamental
goal
population
genetics.
In
humans,
traditional
approaches
for
describing
often
rely
on
discrete
ancestry
labels,
which,
despite
their
utility,
can
obscure
complex,
multi-faceted
nature
human
history.
These
labels
risk
oversimplifying
by
ignoring
its
temporal
depth
and
geographic
continuity,
may
therefore
conflate
notions
race,
ethnicity,
geography,
ancestry.
Here,
we
present
method
that
capitalizes
rich
genealogical
information
encoded
in
genomic
tree
sequences
to
infer
locations
shared
ancestors
sample
sequenced
individuals.
We
use
this
history
set
genomes
sampled
from
Europe,
Asia,
Africa,
accurately
recovering
major
movements
those
continents.
Our
findings
demonstrate
importance
defining
spatial-temporal
context
caution
against
oversimplified
interpretations
data
prevalent
contemporary
discussions
race
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2025,
Номер
unknown
Опубликована: Фев. 15, 2025
Inference
of
Ancestral
Recombination
Graphs
(ARGs)
is
central
interest
in
the
analysis
genomic
variation.
ARGs
can
be
specified
terms
topologies
and
coalescence
times.
The
times
are
usually
estimated
using
an
informative
prior
derived
from
coalescent
theory,
but
this
may
generate
biased
estimates
also
complicate
downstream
inferences
based
on
ARGs.
Here
we
introduce,
POLEGON,
a
novel
approach
for
estimating
branch
lengths
which
uses
uninformative
prior.
Using
extensive
simulations,
show
that
method
provides
improved
lead
to
more
accurate
effective
population
sizes
under
wide
range
demographic
assumptions.
It
improves
other
including
mutation
rates.
We
apply
data
1000
Genomes
Project
investigate
size
histories
differential
signatures
across
populations.
estimate
HLA
region,
they
exceed
30
million
years
multiple
segments.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Апрель 14, 2024
Abstract
Spatial
patterns
of
genetic
relatedness
among
samples
reflect
the
past
movements
their
ancestors.
Our
ability
to
untangle
this
history
has
potential
improve
dramatically
given
that
we
can
now
infer
ultimate
description
relatedness,
ancestral
recombination
graph
(ARG).
By
extending
spatial
theory
previously
applied
trees,
generalize
common
model
Brownian
motion
full
ARGs,
thereby
accounting
for
correlations
in
trees
along
a
chromosome
while
efficiently
computing
likelihood-based
estimates
dispersal
rate
and
ancestor
locations,
with
associated
uncertainties.
We
evaluate
model’s
reconstruct
histories
using
individual-based
simulations
unfortunately
find
clear
bias
locations.
investigate
causes
bias,
pinpointing
discrepancy
between
true
process
at
events.
This
highlights
key
hurdle
ubiquitous
analytically-tractable
from
which
otherwise
provide
an
efficient
method
inference,
uncertainties,
all
information
available
ARG.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Фев. 21, 2024
ABSTRACT
As
population
genetics
data
increases
in
size
new
methods
have
been
developed
to
store
genetic
information
efficient
ways,
such
as
tree
sequences.
These
structures
are
computationally
and
storage
efficient,
but
not
interchangeable
with
existing
used
for
many
inference
methodologies
the
use
of
convolutional
neural
networks
(CNNs)
applied
alignments.
To
better
utilize
these
we
propose
implement
a
graph
network
(GCN)
directly
learn
from
sequence
topology
node
data,
allowing
applications
without
an
intermediate
step
converting
sequences
alignment
format.
We
then
compare
our
approach
standard
CNN
approaches
on
set
previously
defined
benchmarking
tasks
including
recombination
rate
estimation,
positive
selection
detection,
introgression
demographic
model
parameter
inference.
show
that
can
be
learned
using
GCN
perform
well
common
accuracies
roughly
matching
or
even
exceeding
CNN-based
method.
become
more
widely
research
foresee
developments
optimizations
this
work
provide
foundation
moving
forward.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2023,
Номер
unknown
Опубликована: Июнь 8, 2023
Abstract
Recombination
is
an
ongoing
and
increasingly
important
feature
of
circulating
lineages
SARS-CoV-2,
challenging
how
we
represent
the
evolutionary
history
this
virus
giving
rise
to
new
variants
potential
public
health
concern
by
combining
transmission
immune
evasion
properties
different
lineages.
Detection
recombinant
strains
challenging,
with
most
methods
looking
for
breaks
between
sets
mutations
that
characterise
distinct
In
addition,
many
basic
approaches
fundamental
study
viral
evolution
assume
recombination
negligible,
in
a
single
phylogenetic
tree
can
genetic
ancestry
strains.
Here
present
initial
version
sc2ts,
method
automatically
detect
recombinants
real
time
cohesively
integrate
them
into
genealogy
form
ancestral
graph
(ARG),
which
jointly
records
mutation,
inheritance.
We
infer
two
ARGs
under
sampling
strategies,
their
properties.
One
contains
1.27
million
sequences
sampled
up
June
30,
2021,
second
more
sparsely
sampled,
consisting
657K
2022.
find
both
are
highly
consistent
known
features
SARS-CoV-2
evolution,
recovering
backbone
phylogeny,
mutational
spectra,
recapitulating
details
on
majority
Using
well-established
feature-rich
tskit
library,
also
be
stored
concisely
processed
efficiently
using
standard
Python
tools.
For
example,
ARG
sequences—encoding
inferred
reticulate
ancestry,
variation,
extensive
metadata—requires
58MB
storage,
loads
less
than
second.
The
ability
fully
effects
downstream
analyses,
quickly
recombinants,
utilise
efficient
convenient
platform
computation
based
well-engineered
technologies
makes
sc2ts
promising
approach.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Март 12, 2024
Inference
of
demographic
and
evolutionary
parameters
from
a
sample
genome
sequences
often
proceeds
by
first
inferring
identical-by-descent
(IBD)
segments.
By
exploiting
efficient
data
encoding
based
on
the
ancestral
recombination
graph
(ARG),
we
obtain
three
major
advantages
over
current
approaches:
(i)
no
need
to
impose
length
threshold
IBD
segments,
(ii)
can
be
defined
without
hard-to-verify
requirement
recombination,
(iii)
computation
time
reduced
with
little
loss
statistical
efficiency
using
only
segments
set
sequence
pairs
that
scales
linearly
size.
We
demonstrate
powerful
inferences
when
true
information
is
available
simulated
data.
For
inferred
real
data,
propose
an
approximate
Bayesian
inference
algorithm
use
it
show
poorly-inferred
short
improve
estimation
precision.
precision
similar
previously-published
estimator
despite
4
000-fold
reduction
in
used
for
inference.
Computational
cost
limits
model
complexity
our
approach,
but
are
able
incorporate
unknown
nuisance
misspecification,
still
finding
improved
parameter
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Март 14, 2024
Abstract
Summary
Ancestral
recombination
graphs
(ARGs)
encode
the
ensemble
of
correlated
genealogical
trees
arising
from
in
a
compact
and
efficient
structure,
are
fundamental
importance
population
statistical
genetics.
Recent
breakthroughs
have
made
it
possible
to
simulate
infer
ARGs
at
biobank
scale,
there
is
now
intense
interest
using
ARG-based
methods
across
broad
range
applications,
particularly
genome-wide
association
studies
(GWAS).
Sophisticated
exist
genetics
models,
but
currently
no
software
quantitative
traits
directly
these
ARGs.
To
apply
existing
trait
simulators
users
must
export
genotype
data,
losing
important
information
about
ancestral
processes
producing
prohibitively
large
files
when
applied
biobank-scale
datasets
GWAS.
We
present
tstrait
,
an
open-source
Python
library
on
ARGs,
show
how
this
user-friendly
can
quickly
phenotypes
for
laptop
computer.
Availability
Implementation
available
download
Package
Index.
Full
documentation
with
examples
workflow
templates
https://tskit.dev/tstrait/docs/
development
version
maintained
GitHub
(
https://github.com/tskit-dev/tstrait
).
Contact
[email protected]
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2023,
Номер
unknown
Опубликована: Июль 11, 2023
Abstract
Recent
breakthroughs
have
enabled
the
inference
of
genealogies
from
large
sequencing
data-sets,
accurately
reconstructing
local
trees
that
describe
genetic
ancestry
at
each
locus.
These
should
also
capture
correlation
structure
along
genome,
reflecting
historical
recombination
events
and
factors
like
demography
natural
selection.
However,
whether
reconstructed
do
this
has
not
been
rigorously
explored.
This
is
important
to
address,
since
uncovering
regions
depart
expectations
can
drive
discovery
new
biological
phenomena.
Addressing
crucial,
as
deviate
reveal
phenomena,
such
suppression
allowing
linked
selection
over
broad
regions,
evidenced
in
humans
adaptive
introgression
various
species.
We
use
a
theoretical
framework
characterise
properties
genealogies,
distribution
genomic
spans
clades
edges,
demonstrate
our
results
match
observations
simulated
scenarios.
Testing
using
leading
approaches,
we
find
departures
for
all
methods.
method
Relate,
set
simple
corrections
almost
complete
recovery
target
distributions.
Applying
these
Relate
2504
human
genomes,
observe
an
excess
with
unexpectedly
long
(125
p
<
1
·
10
−
12
clustering
into
50
regions),
indicating
localised
recombination.
The
strongest
signal
corresponds
known
inversion
on
chromosome
17,
while
second
represents
previously
unknown
10,
which
most
common
(21%)
S.
Asians
correlates
GWAS
hits
range
phenotypes
including
immunological
traits.
Other
signals
suggest
additional
inversions
(4),
copy
number
changes
(2),
complex
rearrangements
or
other
variants
(12),
well
28
strong
support
but
no
clear
classification.
Our
approach
be
readily
applied
species,
show
offer
untapped
potential
study
structural
variation
its
impacts
population
level,
revealing
phenomena
impacting
evolution.