Species
delimitation
is
the
process
of
distinguishing
between
populations
same
species
and
distinct
a
particular
group
organisms.
Various
methods
exist
for
inferring
limits,
with
most
them
being
rooted
in
Coalescent
Theory.
Their
primary
goal
to
identify
independently
evolving
lineages
that
should
represent
separate
species.
models
have
improved
by
enabling
explicit
testing
hypotheses
regarding
evolutionary
independence
among
lineages.
However,
they
some
limitations,
especially
complex
scenarios,
large
datasets,
varying
genetic
data
types.
In
this
context,
machine
learning
(ML)
can
be
considered
as
promising
analytical
tool,
clearly
provides
an
effective
way
explore
dataset
structures
when
species-level
divergences
are
hypothesised.
review,
we
examine
use
ML
provide
overview
critical
appraisal
existing
workflows.
We
also
simple
explanations
on
how
main
types
approaches
operate,
which
help
researchers
students
interested
field.
While
current
designed
infer
limits
analytically
powerful,
present
specific
limitations
not
definitive
alternatives
traditional
coalescent
delimitation.
For
instance,
there
clear
utilisation
simulated
data,
supervised
deep
approaches,
type
representation
used
each
approach.
then
discuss
strengths
weaknesses
pipelines,
propose
best
practices
delimitation,
offer
insights
into
potential
future
applications.
Generative
adversarial
networks
domain
adaptation
techniques,
could
partially
address
misspecification
issue
related
simulating
data.
Besides,
integrating
hypothesis
process,
alongside
available
coalescent-based
methods,
enable
more
comprehensive
exploration
parameters,
improving
accuracy
biological
interpretability
analyses.
Additionally,
suggest
guidelines
enhancing
accessibility,
effectiveness,
objectivity
processes,
aiming
transformative
perspective
subject.
Mathematics,
Год журнала:
2023,
Номер
11(14), С. 3055 - 3055
Опубликована: Июль 10, 2023
The
evolving
field
of
generative
artificial
intelligence
(GenAI),
particularly
deep
learning,
is
revolutionizing
a
host
scientific
and
technological
sectors.
One
the
pivotal
innovations
within
this
domain
emergence
adversarial
networks
(GANs).
These
unique
models
have
shown
remarkable
capabilities
in
crafting
synthetic
data,
closely
emulating
real-world
distributions.
Notably,
their
application
to
gene
expression
data
systems
fascinating
rapidly
growing
focus
area.
Restrictions
related
ethical
logistical
issues
often
limit
size,
diversity,
data-gathering
speed
data.
Herein
lies
potential
GANs,
as
they
are
capable
producing
offering
solution
these
limitations.
This
review
provides
thorough
analysis
most
recent
advancements
at
innovative
crossroads
GANs
specifically
during
period
from
2019
2023.
In
context
fast-paced
progress
learning
technologies,
accurate
inclusive
reviews
current
practices
critical
guiding
subsequent
research
efforts,
sharing
knowledge,
catalyzing
continual
growth
discipline.
review,
through
highlighting
studies
seminal
works,
serves
key
resource
for
academics
professionals
alike,
aiding
journey
compelling
confluence
systems.
The
application
of
machine
learning
approaches
in
phylogenetics
has
been
impeded
by
the
vast
model
space
associated
with
inference.
Supervised
require
data
from
across
this
to
train
models.
Because
this,
previous
have
typically
limited
inferring
relationships
among
unrooted
quartets
taxa,
where
there
are
only
three
possible
topologies.
Here,
we
explore
potential
generative
adversarial
networks
(GANs)
address
limitation.
GANs
consist
a
generator
and
discriminator:
at
each
step,
aims
create
that
is
similar
real
data,
while
discriminator
attempts
distinguish
generated
data.
By
using
an
evolutionary
as
generator,
use
make
inferences.
Since
new
can
be
considered
iteration,
heuristic
searches
complex
spaces
possible.
Thus,
offer
solution
challenges
applying
phylogenetics.
Abstract
Understanding
natural
selection
and
other
forms
of
non-neutrality
is
a
major
focus
for
the
use
machine
learning
in
population
genetics.
Existing
methods
rely
on
computationally
intensive
simulated
training
data.
Unlike
efficient
neutral
coalescent
simulations
demographic
inference,
realistic
typically
require
slow
forward
simulations.
Because
there
are
many
possible
modes
selection,
high
dimensional
parameter
space
must
be
explored,
with
no
guarantee
that
models
close
to
real
processes.
Finally,
it
difficult
interpret
trained
neural
networks,
leading
lack
understanding
about
what
features
contribute
classification.
Here
we
develop
new
approach
detect
local
evolutionary
processes
requires
relatively
few
during
training.
We
build
upon
generative
adversarial
network
simulate
This
consists
generator
(fitted
model),
discriminator
(convolutional
network)
predicts
whether
genomic
region
or
fake.
As
can
only
generate
data
under
processes,
regions
recognizes
as
having
probability
being
“real”
do
not
fit
model
therefore
candidates
targets
selection.
To
incentivize
identification
specific
mode
fine-tune
small
number
custom
non-neutral
show
this
has
power
various
simulations,
finds
positive
identified
by
state-of-the-art
genetic
three
human
populations.
how
networks
clustering
hidden
units
based
their
correlation
patterns
known
summary
statistics.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Фев. 21, 2024
ABSTRACT
As
population
genetics
data
increases
in
size
new
methods
have
been
developed
to
store
genetic
information
efficient
ways,
such
as
tree
sequences.
These
structures
are
computationally
and
storage
efficient,
but
not
interchangeable
with
existing
used
for
many
inference
methodologies
the
use
of
convolutional
neural
networks
(CNNs)
applied
alignments.
To
better
utilize
these
we
propose
implement
a
graph
network
(GCN)
directly
learn
from
sequence
topology
node
data,
allowing
applications
without
an
intermediate
step
converting
sequences
alignment
format.
We
then
compare
our
approach
standard
CNN
approaches
on
set
previously
defined
benchmarking
tasks
including
recombination
rate
estimation,
positive
selection
detection,
introgression
demographic
model
parameter
inference.
show
that
can
be
learned
using
GCN
perform
well
common
accuracies
roughly
matching
or
even
exceeding
CNN-based
method.
become
more
widely
research
foresee
developments
optimizations
this
work
provide
foundation
moving
forward.
PLoS Computational Biology,
Год журнала:
2023,
Номер
19(10), С. e1011584 - e1011584
Опубликована: Окт. 30, 2023
Applications
of
generative
models
for
genomic
data
have
gained
significant
momentum
in
the
past
few
years,
with
scopes
ranging
from
characterization
to
generation
segments
and
functional
sequences.
In
our
previous
study,
we
demonstrated
that
adversarial
networks
(GANs)
restricted
Boltzmann
machines
(RBMs)
can
be
used
create
novel
high-quality
artificial
genomes
(AGs)
which
preserve
complex
characteristics
real
such
as
population
structure,
linkage
disequilibrium
selection
signals.
However,
a
major
drawback
these
is
scalability,
since
large
feature
space
genome-wide
increases
computational
complexity
vastly.
To
address
this
issue,
implemented
convolutional
Wasserstein
GAN
(WGAN)
model
along
conditional
RBM
(CRBM)
framework
generating
AGs
high
SNP
number.
These
implicitly
learn
varying
landscape
haplotypic
structure
order
capture
correlation
patterns
genome
generate
wide
diversity
plausible
haplotypes.
We
performed
comparative
analyses
assess
both
quality
generated
haplotypes
amount
possible
privacy
leakage
training
data.
As
importance
genetic
becomes
more
prevalent,
need
effective
protection
measures
increases.
neural
possess
many
without
substantial
dataset.
near
future,
further
improvements
haplotype
preservation,
large-scale
databases
assembled
provide
easily
accessible
surrogates
databases,
allowing
researchers
conduct
studies
diverse
within
safe
ethical
terms
donor
privacy.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Авг. 7, 2024
Abstract
Synthetic
data
generation
via
generative
modeling
has
recently
become
a
prominent
research
field
in
genomics,
with
applications
ranging
from
functional
sequence
design
to
high-quality,
privacy-preserving
artificial
silico
genomes.
Following
body
of
work
on
Artificial
Genomes
(AGs)
created
various
models
trained
raw
genomic
input,
we
propose
conceptually
different
approach
address
the
issues
scalability
and
complexity
very
high
dimensions.
Our
method
combines
dimensionality
reduction,
achieved
by
Principal
Component
Analysis
(PCA),
Generative
Adversarial
Network
(GAN)
learning
this
reduced
space.
Using
framework,
generated
proxy
datasets
for
diverse
human
populations
around
world.
We
compared
quality
AGs
our
established
report
improvements
capturing
population
structure,
linkage
disequilibrium,
metrics
related
privacy
leakage.
Furthermore,
developed
frugal
model
orders
magnitude
fewer
parameters
comparable
performance
larger
models.
For
assessment,
also
implemented
new
evaluation
metric
based
information
theory
measure
local
haplotypic
diversity,
showing
that
yield
higher
diversity
than
real
In
addition,
addressed
shrinkage
issue
associated
PCA
modeling,
examined
its
relation
nearest
neighbor
resemblance
metric,
proposed
resolution.
Finally,
evaluated
effect
binarization
methods
output
AGs.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2023,
Номер
unknown
Опубликована: Март 8, 2023
Understanding
natural
selection
in
humans
and
other
species
is
a
major
focus
for
the
use
of
machine
learning
population
genetics.
Existing
methods
rely
on
computationally
intensive
simulated
training
data.
Unlike
efficient
neutral
coalescent
simulations
demographic
inference,
realistic
typically
requires
slow
forward
simulations.
Because
there
are
many
possible
modes
selection,
high
dimensional
parameter
space
must
be
explored,
with
no
guarantee
that
models
close
to
real
processes.
Mismatches
between
data
test
can
lead
incorrect
inference.
Finally,
it
difficult
interpret
trained
neural
networks,
leading
lack
understanding
about
what
features
contribute
classification.
Here
we
develop
new
approach
detect
relatively
few
during
training.
We
Generative
Adversarial
Network
(GAN)
simulate
The
resulting
GAN
consists
generator
(fitted
model)
discriminator
(convolutional
network).
For
genomic
region,
predicts
whether
"real"
or
"fake"
sense
could
have
been
by
generator.
As
includes
regions
experienced
cannot
produce
such
regions,
probability
being
likely
selection.
To
further
incentivize
this
behavior,
"fine-tune"
small
number
show
has
power
simulations,
finds
under
identified
state-of-the
art
genetic
three
human
populations.
how
networks
clustering
hidden
units
based
their
correlation
patterns
known
summary
statistics.
In
summary,
our
novel,
efficient,
powerful
way
IEEE Journal on Selected Areas in Information Theory,
Год журнала:
2024,
Номер
5, С. 221 - 235
Опубликована: Янв. 1, 2024
Local
differential
privacy
is
a
powerful
method
for
privacy-preserving
data
collection.
In
this
paper,
we
develop
framework
training
Generative
Adversarial
Networks
(GANs)
on
differentially
privatized
data.
We
show
that
entropic
regularization
of
optimal
transport
–
popular
in
the
literature
has
often
been
leveraged
its
computational
benefits
enables
generator
to
learn
raw
(unprivatized)
distribution
even
though
it
only
access
samples.
prove
at
same
time
leads
fast
statistical
convergence
parametric
rate.
This
shows
uniquely
mitigation
both
effects
privatization
noise
and
curse
dimensionality
convergence.
provide
experimental
evidence
support
efficacy
our
practice.