The era of the ARG: An introduction to ancestral recombination graphs and their significance in empirical evolutionary genomics
PLoS Genetics,
Journal Year:
2024,
Volume and Issue:
20(1), P. e1011110 - e1011110
Published: Jan. 18, 2024
In
the
presence
of
recombination,
evolutionary
relationships
between
a
set
sampled
genomes
cannot
be
described
by
single
genealogical
tree.
Instead,
are
related
complex,
interwoven
collection
genealogies
formalized
in
structure
called
an
ancestral
recombination
graph
(ARG).
An
ARG
extensively
encodes
ancestry
genome(s)
and
thus
is
replete
with
valuable
information
for
addressing
diverse
questions
biology.
Despite
its
potential
utility,
technological
methodological
limitations,
along
lack
approachable
literature,
have
severely
restricted
awareness
application
ARGs
evolution
research.
Excitingly,
recent
progress
reconstruction
simulation
made
ARG-based
approaches
feasible
many
systems.
this
review,
we
provide
accessible
introduction
exploration
ARGs,
survey
breakthroughs,
describe
to
further
existing
goals
open
avenues
inquiry
that
were
previously
inaccessible
genomics.
Through
discussion,
aim
more
widely
disseminate
promise
genomics
encourage
broader
development
adoption
inference.
Language: Английский
A geographic history of human genetic ancestry
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: March 29, 2024
Describing
the
distribution
of
genetic
variation
across
individuals
is
a
fundamental
goal
population
genetics.
In
humans,
traditional
approaches
for
describing
often
rely
on
discrete
ancestry
labels,
which,
despite
their
utility,
can
obscure
complex,
multi-faceted
nature
human
history.
These
labels
risk
oversimplifying
by
ignoring
its
temporal
depth
and
geographic
continuity,
may
therefore
conflate
notions
race,
ethnicity,
geography,
ancestry.
Here,
we
present
method
that
capitalizes
rich
genealogical
information
encoded
in
genomic
tree
sequences
to
infer
locations
shared
ancestors
sample
sequenced
individuals.
We
use
this
history
set
genomes
sampled
from
Europe,
Asia,
Africa,
accurately
recovering
major
movements
those
continents.
Our
findings
demonstrate
importance
defining
spatial-temporal
context
caution
against
oversimplified
interpretations
data
prevalent
contemporary
discussions
race
Language: Английский
A geographic history of human genetic ancestry
Science,
Journal Year:
2025,
Volume and Issue:
387(6741), P. 1391 - 1397
Published: March 27, 2025
Describing
the
distribution
of
genetic
variation
across
individuals
is
a
fundamental
goal
population
genetics.
We
present
method
that
capitalizes
on
rich
genealogical
information
encoded
in
genomic
tree
sequences
to
infer
geographic
locations
shared
ancestors
sample
sequenced
individuals.
used
this
history
ancestry
set
human
genomes
sampled
from
Europe,
Asia,
and
Africa,
accurately
recovering
major
movements
those
continents.
Our
findings
demonstrate
importance
defining
spatiotemporal
context
when
describing
caution
against
oversimplified
interpretations
data
prevalent
contemporary
discussions
race
ancestry.
Language: Английский
A general and efficient representation of ancestral recombination graphs
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2023,
Volume and Issue:
unknown
Published: Nov. 4, 2023
Abstract
As
a
result
of
recombination,
adjacent
nucleotides
can
have
different
paths
genetic
inheritance
and
therefore
the
genealogical
trees
for
sample
DNA
sequences
vary
along
genome.
The
structure
capturing
details
these
intricately
interwoven
is
referred
to
as
an
ancestral
recombination
graph
(ARG).
Classical
formalisms
focused
on
mapping
coalescence
events
nodes
in
ARG.
This
approach
out
step
with
modern
developments,
which
do
not
represent
terms
or
explicitly
infer
them.
We
present
simple
formalism
that
defines
ARG
specific
genomes
their
intervals
inheritance,
show
how
it
generalises
classical
treatments
encompasses
outputs
recent
methods.
discuss
nuances
arising
from
this
more
general
structure,
argue
forms
appropriate
basis
software
standard
rapidly
growing
field.
Language: Английский
Enabling efficient analysis of biobank-scale data with genotype representation graphs
Drew DeHaas,
No information about this author
Ziqing Pan,
No information about this author
Xinzhu Wei
No information about this author
et al.
Nature Computational Science,
Journal Year:
2024,
Volume and Issue:
unknown
Published: Dec. 5, 2024
Language: Английский
Tsbrowse: an interactive browser for Ancestral Recombination Graphs
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2025,
Volume and Issue:
unknown
Published: April 23, 2025
Abstract
Ancestral
Recombination
Graphs
(ARGs)
represent
the
interwoven
paths
of
genetic
ancestry
for
a
set
recombining
sequences.
The
ability
to
capture
evolutionary
history
samples
makes
ARGs
valuable
in
wide
range
applications
population
and
statistical
genetics.
ARG-based
approaches
are
increasingly
becoming
part
data
analysis
pipelines
due
breakthroughs
enabling
ARG
inference
at
biobank-scale.
However,
there
is
lack
visualisation
tools,
which
crucial
validating
inferences
generating
hypotheses.
We
present
tsbrowse
,
an
open-source
Python
web-app
interactive
fundamental
building-blocks
ARGs,
i.e.,
nodes,
edges
mutations.
demonstrate
application
various
sources
scenarios,
highlight
its
key
features
browsability
along
genome,
user
interactivity,
scalability
very
large
sample
sizes.
Availability
package:
https://pypi.org/project/tsbrowse/
Development
version:
https://github.com/tskit.dev/tsbrowse
Documentation:
https://tskit.dev/tsbrowse/docs/
Language: Английский
The length of haplotype blocks and signals of structural variation in reconstructed genealogies
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2023,
Volume and Issue:
unknown
Published: July 11, 2023
Abstract
Recent
breakthroughs
have
enabled
the
inference
of
genealogies
from
large
sequencing
data-sets,
accurately
reconstructing
local
trees
that
describe
genetic
ancestry
at
each
locus.
These
should
also
capture
correlation
structure
along
genome,
reflecting
historical
recombination
events
and
factors
like
demography
natural
selection.
However,
whether
reconstructed
do
this
has
not
been
rigorously
explored.
This
is
important
to
address,
since
uncovering
regions
depart
expectations
can
drive
discovery
new
biological
phenomena.
Addressing
crucial,
as
deviate
reveal
phenomena,
such
suppression
allowing
linked
selection
over
broad
regions,
evidenced
in
humans
adaptive
introgression
various
species.
We
use
a
theoretical
framework
characterise
properties
genealogies,
distribution
genomic
spans
clades
edges,
demonstrate
our
results
match
observations
simulated
scenarios.
Testing
using
leading
approaches,
we
find
departures
for
all
methods.
method
Relate,
set
simple
corrections
almost
complete
recovery
target
distributions.
Applying
these
Relate
2504
human
genomes,
observe
an
excess
with
unexpectedly
long
(125
p
<
1
·
10
−
12
clustering
into
50
regions),
indicating
localised
recombination.
The
strongest
signal
corresponds
known
inversion
on
chromosome
17,
while
second
represents
previously
unknown
10,
which
most
common
(21%)
S.
Asians
correlates
GWAS
hits
range
phenotypes
including
immunological
traits.
Other
signals
suggest
additional
inversions
(4),
copy
number
changes
(2),
complex
rearrangements
or
other
variants
(12),
well
28
strong
support
but
no
clear
classification.
Our
approach
be
readily
applied
species,
show
offer
untapped
potential
study
structural
variation
its
impacts
population
level,
revealing
phenomena
impacting
evolution.
Language: Английский
Genotype Representation Graphs: Enabling Efficient Analysis of Biobank-Scale Data
Drew DeHaas,
No information about this author
Ziqing Pan,
No information about this author
Xinzhu Wei
No information about this author
et al.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: April 28, 2024
Abstract
Computational
analysis
of
a
large
number
genomes
requires
data
structure
that
can
represent
the
dataset
compactly
while
also
enabling
efficient
operations
on
variants
and
samples.
Current
practice
is
to
store
large-scale
genetic
polymorphism
using
tabular
structures
file
formats,
where
rows
columns
samples
variants.
However,
encoding
in
such
formats
has
become
unsustainable.
For
example,
UK
Biobank
200,000
phased
whole
exceeded
350
terabytes
(TB)
Variant
Call
Format
(VCF),
cumbersome
inefficient
work
with.
To
mitigate
computational
burden,
we
introduce
Genotype
Representation
Graph
(GRG),
an
extremely
compact
losslessly
present
whole-genome
polymorphisms.
A
GRG
fully
connected
hierarchical
graph
exploits
variant-sharing
across
samples,
leveraging
ideas
inspired
by
Ancestral
Recombination
Graphs.
Capturing
multitree
compresses
biobank-scale
human
point
it
fit
typical
server’s
RAM
(5-26
gigabytes
(GB)
per
chromosome),
enables
graph-traversal
algorithms
trivially
reuse
computed
values,
both
which
significantly
reduce
computation
time.
We
have
developed
command-line
tool
library
usable
via
C++
Python
for
constructing
processing
files
scales
million
genomes.
It
takes
160GB
disk
space
encode
information
as
GRG,
more
than
13
times
smaller
size
compressed
VCF.
show
summaries
allele
frequency
association
effect
be
traversal
runs
faster
all
tested
alternatives,
including
vcf.gz
,
PLINK
BED,
tree
sequence,
XSI,
Savvy.
Furthermore,
particularly
suitable
doing
repeated
calculations
interactive
analysis.
anticipate
GRG-based
will
improve
scalability
various
types
generally
lower
cost
analyzing
genomic
datasets.
Language: Английский
Analysis-ready VCF at Biobank scale using Zarr
Eric Czech,
No information about this author
Timothy R. Millar,
No information about this author
Will Tyler
No information about this author
et al.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: June 12, 2024
Abstract
Background
Variant
Call
Format
(VCF)
is
the
standard
file
format
for
interchanging
genetic
variation
data
and
associated
quality
control
metrics.
The
usual
row-wise
encoding
of
VCF
model
(either
as
text
or
packed
binary)
emphasises
efficient
retrieval
all
a
given
variant,
but
accessing
on
field
sample
basis
inefficient.
Biobank
scale
datasets
currently
available
consist
hundreds
thousands
whole
genomes
terabytes
compressed
VCF.
Row-wise
storage
fundamentally
unsuitable
more
scalable
approach
needed.
Results
Zarr
storing
multi-dimensional
that
widely
used
across
sciences,
ideally
suited
to
massively
parallel
processing.
We
present
specification,
an
using
Zarr,
along
with
fundamental
software
infrastructure
reliable
conversion
at
scale.
show
how
this
far
than
based
approaches,
competitive
specialised
methods
genotype
in
terms
compression
ratios
single-threaded
calculation
performance.
case
studies
subsets
three
large
human
(Genomics
England:
n
=78,195;
Our
Future
Health:
=651,050;
All
Us:
=245,394)
genome
Norway
Spruce
(
=1,063)
SARS-CoV-2
=4,484,157).
demonstrate
potential
enable
new
generation
high-performance
cost-effective
applications
via
illustrative
examples
cloud
computing
GPUs.
Conclusions
Large
row-encoded
files
are
major
bottleneck
current
research,
processing
these
incurs
substantial
cost.
building
widely-used,
open-source
technologies
has
greatly
reduce
costs,
may
diverse
ecosystem
next-generation
tools
analysing
directly
from
cloud-based
object
stores,
while
maintaining
compatibility
existing
file-oriented
workflows.
Key
Points
supported,
underlying
entrenched
bioinformatics
pipelines.
(or
inherently
inefficient
large-scale
provides
solution,
by
fields
separately
chunk-compressed
binary
format.
Language: Английский
Compressive Pangenomics Using Mutation-Annotated Networks
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: July 4, 2024
Abstract
Pangenomics
is
an
emerging
field
that
uses
a
collection
of
genomes
species
instead
single
reference
genome
to
overcome
bias
and
study
the
within-species
genetic
diversity.
Future
pangenomics
applications
will
require
analyzing
large
ever-growing
collections
genomes.
Therefore,
choice
data
representation
key
determinant
scope,
as
well
computational
memory
performance
pangenomic
analyses.
Current
pangenome
formats,
while
capable
storing
variations
across
multiple
genomes,
fail
capture
shared
evolutionary
mutational
histories
among
them,
thereby
limiting
their
applications.
They
are
also
inefficient
for
storage,
therefore
face
significant
scaling
challenges.
In
this
manuscript,
we
propose
PanMAN,
novel
structure
information-wise
richer
than
all
existing
formats
–
in
addition
representing
alignment
variation
PanMAN
represents
inferred
between
those
By
using
“evolutionary
compression”,
achieves
5.2
680-fold
compression
over
other
variation-preserving
formats.
PanMAN’s
relative
generally
improves
with
larger
datasets
it
compatible
any
method
inferring
phylogenies
ancestral
nucleotide
states.
Using
SARS-CoV-2
case
study,
show
offers
detailed
accurate
portrayal
pathogen’s
history,
facilitating
discovery
new
biological
insights.
We
present
panmanUtils
,
software
toolkit
supports
common
analyses
makes
PanMANs
interoperable
tools
poised
enhance
scale,
speed,
resolution,
overall
scope
sharing.
Language: Английский