Biologically inspired graphs to explore massive genetic datasets
Nature Computational Science,
Journal Year:
2025,
Volume and Issue:
unknown
Published: Jan. 31, 2025
Language: Английский
IGD: A simple, efficient genotype data format
Drew DeHaas,
No information about this author
Xinzhu Wei
No information about this author
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2025,
Volume and Issue:
unknown
Published: Feb. 8, 2025
Abstract
Motivation
While
there
are
a
variety
of
file
formats
for
storing
reference-sequence-aligned
genotype
data,
many
complex
or
inefficient.
Programming
language
support
such
is
often
limited.
A
format
that
simple
to
understand
and
implement
–
yet
fast
small
helpful
research
on
highly
scalable
bioinformatics.
Results
We
present
the
Indexable
Genotype
Data
(IGD)
format,
uncompressed
binary
can
be
more
than
100
times
faster
3.5
smaller
vcf.gz
Biobank-scale
whole-genome
sequence
data.
The
implementation
reading
writing
IGD
in
Python
under
350
lines
code,
which
reflects
simplicity
format.
Availability
C++
library
IGD,
tooling
convert
.
files,
found
at
https://github.com/aprilweilab/picovcf
https://github.com/aprilweilab/pyigd
Language: Английский
On ARGs, pedigrees, and genetic relatedness matrices
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2025,
Volume and Issue:
unknown
Published: March 5, 2025
Abstract
Genetic
relatedness
is
a
central
concept
in
genetics,
underpinning
studies
of
population
and
quantitative
genetics
human,
animal,
plant
settings.
It
typically
stored
as
genetic
matrix
(GRM),
whose
elements
are
pairwise
values
between
individuals.
This
has
been
defined
various
contexts
based
on
pedigree,
genotype,
phylogeny,
coalescent
times,
and,
recently,
ancestral
recombination
graph
(ARG).
ARG-based
GRMs
have
found
to
better
capture
the
structure
improve
association
relative
genotype
GRM.
However,
calculating
further
operations
with
them
fundamentally
challenging
due
inherent
quadratic
time
space
complexity.
Here,
we
first
discuss
different
definitions
unifying
context,
making
use
additive
model
trait
provide
definition
“branch
relatedness”
corresponding
GRM”.
We
explore
relationship
branch
pedigree
through
case
study
French-Canadian
individuals
that
known
pedigree.
Through
tree
sequence
encoding
an
ARG,
then
derive
efficient
algorithm
for
computing
products
GRM
general
vector,
without
explicitly
forming
leverages
sparse
genomes
hence
enables
large-scale
computations
demonstrate
power
this
by
developing
randomized
principal
components
sequences
easily
scales
millions
genomes.
All
algorithms
implemented
open
source
tskit
Python
package.
Taken
together,
work
consolidates
notions
leveraging
ARG
it
provides
enable
scale
mega-scale
genomic
datasets.
Language: Английский
Analysis-ready VCF at Biobank scale using Zarr
Eric Czech,
No information about this author
Timothy R. Millar,
No information about this author
Will Tyler
No information about this author
et al.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: June 12, 2024
Abstract
Background
Variant
Call
Format
(VCF)
is
the
standard
file
format
for
interchanging
genetic
variation
data
and
associated
quality
control
metrics.
The
usual
row-wise
encoding
of
VCF
model
(either
as
text
or
packed
binary)
emphasises
efficient
retrieval
all
a
given
variant,
but
accessing
on
field
sample
basis
inefficient.
Biobank
scale
datasets
currently
available
consist
hundreds
thousands
whole
genomes
terabytes
compressed
VCF.
Row-wise
storage
fundamentally
unsuitable
more
scalable
approach
needed.
Results
Zarr
storing
multi-dimensional
that
widely
used
across
sciences,
ideally
suited
to
massively
parallel
processing.
We
present
specification,
an
using
Zarr,
along
with
fundamental
software
infrastructure
reliable
conversion
at
scale.
show
how
this
far
than
based
approaches,
competitive
specialised
methods
genotype
in
terms
compression
ratios
single-threaded
calculation
performance.
case
studies
subsets
three
large
human
(Genomics
England:
n
=78,195;
Our
Future
Health:
=651,050;
All
Us:
=245,394)
genome
Norway
Spruce
(
=1,063)
SARS-CoV-2
=4,484,157).
demonstrate
potential
enable
new
generation
high-performance
cost-effective
applications
via
illustrative
examples
cloud
computing
GPUs.
Conclusions
Large
row-encoded
files
are
major
bottleneck
current
research,
processing
these
incurs
substantial
cost.
building
widely-used,
open-source
technologies
has
greatly
reduce
costs,
may
diverse
ecosystem
next-generation
tools
analysing
directly
from
cloud-based
object
stores,
while
maintaining
compatibility
existing
file-oriented
workflows.
Key
Points
supported,
underlying
entrenched
bioinformatics
pipelines.
(or
inherently
inefficient
large-scale
provides
solution,
by
fields
separately
chunk-compressed
binary
format.
Language: Английский