Nature Methods,
Год журнала:
2023,
Номер
20(5), С. 665 - 672
Опубликована: Апрель 10, 2023
Abstract
The
count
table,
a
numeric
matrix
of
genes
×
cells,
is
the
basic
input
data
structure
in
analysis
single-cell
RNA-sequencing
data.
A
common
preprocessing
step
to
adjust
counts
for
variable
sampling
efficiency
and
transform
them
so
that
variance
similar
across
dynamic
range.
These
steps
are
intended
make
subsequent
application
generic
statistical
methods
more
palatable.
Here,
we
describe
four
transformation
approaches
based
on
delta
method,
model
residuals,
inferred
latent
expression
state
factor
analysis.
We
compare
their
strengths
weaknesses
find
latter
three
have
appealing
theoretical
properties;
however,
benchmarks
using
simulated
real-world
data,
it
turns
out
rather
simple
approach,
namely,
logarithm
with
pseudo-count
followed
by
principal-component
analysis,
performs
as
well
or
better
than
sophisticated
alternatives.
This
result
highlights
limitations
current
assessed
bottom-line
performance
benchmarks.
Statistical Science,
Год журнала:
2023,
Номер
38(3)
Опубликована: Март 22, 2023
The
development
of
John
Aitchison's
approach
to
compositional
data
analysis
is
followed
since
his
paper
read
the
Royal
Statistical
Society
in
1982.
logratio
approach,
which
was
proposed
solve
problematic
aspects
working
with
a
fixed-sum
constraint,
summarized
and
reappraised.
It
maintained
that
properties
on
this
originally
built,
main
one
being
subcompositional
coherence,
are
not
required
be
satisfied
exactly—quasi-coherence
sufficient,
near
enough
coherent
for
all
practical
purposes.
This
opens
up
field
using
simpler
transformations,
such
as
power
permit
zero
values
data.
additional
property
exact
isometry,
subsequently
introduced
original
conception,
imposed
use
isometric
but
these
complicated
interpret,
involving
ratios
geometric
means.
If
regarded
important
certain
analytical
contexts,
example,
unsupervised
learning,
it
can
relaxed
by
showing
regular
pairwise
logratios,
well
alternative
quasi-coherent
also
quasi-isometric,
meaning
they
close
isometry
concluded
related
transformations
pivot
logratios
prerequisite
good
practice,
although
many
authors
insist
their
obligatory
use.
conclusion
fully
supported
here
case
studies
geochemistry
genomics,
where
performance
demonstrated
Aitchison,
or
Box–Cox
transforms
compositions
no
replacements
necessary.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2023,
Номер
unknown
Опубликована: Март 29, 2023
Abstract
Single-cell
RNA
sequencing
(scRNA-seq)
has
become
a
standard
approach
to
investigate
molecular
differences
between
cell
states.
Comparisons
of
bioinformatics
methods
for
the
count
matrix
transformation
(normalization)
and
differential
expression
(DE)
analysis
these
data
have
already
highlighted
recommendations
effective
between-sample
comparisons
visualization.
Here,
we
examine
two
remaining
open
questions:
(i)
What
are
best
combinations
transformations
statistical
test
methods,
(ii)
how
do
pseudo-bulk
approaches
perform
in
single-sample
designs?
We
evaluated
performance
343
DE
pipelines
(combinations
eight
types
ten
tests)
on
simulated
real-world
data,
terms
precision,
sensitivity,
false
discovery
rate.
confirm
superior
without
prior
transformation.
For
within-sample
comparisons,
advise
use
three
pseudo-replicates,
provide
simple
R
package
DElegate
facilitate
application
this
approach.
ACM Transactions on Intelligent Systems and Technology,
Год журнала:
2024,
Номер
15(3), С. 1 - 62
Опубликована: Янв. 26, 2024
Single-cell
technologies
are
revolutionizing
the
entire
field
of
biology.
The
large
volumes
data
generated
by
single-cell
high
dimensional,
sparse,
and
heterogeneous
have
complicated
dependency
structures,
making
analyses
using
conventional
machine
learning
approaches
challenging
impractical.
In
tackling
these
challenges,
deep
often
demonstrates
superior
performance
compared
to
traditional
methods.
this
work,
we
give
a
comprehensive
survey
on
in
analysis.
We
first
introduce
background
their
development,
as
well
fundamental
concepts
including
most
popular
architectures.
present
an
overview
analytic
pipeline
pursued
research
applications
while
noting
divergences
due
sources
or
specific
applications.
then
review
seven
tasks
spanning
different
stages
analysis
pipeline,
multimodal
integration,
imputation,
clustering,
spatial
domain
identification,
cell-type
deconvolution,
cell
segmentation,
annotation.
Under
each
task,
describe
recent
developments
classical
methods
discuss
advantages
disadvantages.
Deep
tools
benchmark
datasets
also
summarized
for
task.
Finally,
future
directions
challenges.
This
will
serve
reference
biologists
computer
scientists,
encouraging
collaborations.
Nature Communications,
Год журнала:
2024,
Номер
15(1)
Опубликована: Апрель 27, 2024
Abstract
High
dimensionality
and
noise
have
limited
the
new
biological
insights
that
can
be
discovered
in
scRNA-seq
data.
While
reduction
tools
been
developed
to
extract
signals
from
data,
they
often
require
manual
determination
of
signal
dimension,
introducing
user
bias.
Furthermore,
a
common
data
preprocessing
method,
log
normalization,
unintentionally
distort
Here,
we
develop
scLENS,
tool
circumvents
long-standing
issues
distortion
input.
Specifically,
identify
primary
cause
during
normalization
effectively
address
it
by
uniformizing
cell
vector
lengths
with
L2
normalization.
utilize
random
matrix
theory-based
filtering
robustness
test
enable
data-driven
threshold
for
dimensions.
Our
method
outperforms
11
widely
used
performs
particularly
well
challenging
datasets
high
sparsity
variability.
To
facilitate
use
provide
user-friendly
package
automates
accurate
detection
without
time-consuming
tuning.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2023,
Номер
unknown
Опубликована: Июль 25, 2023
Double
dipping
is
a
well-known
pitfall
in
single-cell
and
spatial
transcriptomics
data
analysis:
after
clustering
algorithm
finds
clusters
as
putative
cell
types
or
domains,
statistical
tests
are
applied
to
the
same
identify
differentially
expressed
(DE)
genes
potential
cell-type
spatial-domain
markers.
Because
that
contribute
inherently
likely
be
identified
DE
genes,
double
can
result
false-positive
markers,
especially
when
spurious,
leading
ambiguously
defined
domains.
To
address
this
challenge,
we
propose
ClusterDE,
method
designed
post-clustering
reliable
markers
of
while
controlling
false
discovery
rate
(FDR)
regardless
quality.
The
core
ClusterDE
involves
generating
synthetic
null
an
silico
negative
control
contains
only
one
type
domain,
allowing
for
detection
removal
spurious
discoveries
caused
by
dipping.
We
demonstrate
controls
FDR
identifies
canonical
top
distinguishing
them
from
housekeeping
genes.
ClusterDE's
ability
discover
absence
such
used
determine
whether
two
ambiguous
should
merged.
Additionally,
compatible
with
state-of-the-art
analysis
pipelines
like
Seurat
Scanpy.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Авг. 26, 2024
Abstract
Background
Selecting
highly
variable
features
is
a
crucial
step
in
most
analysis
pipelines
of
single-cell
RNA-sequencing
(scRNA-seq)
data.
Despite
numerous
methods
proposed
recent
years,
systematic
understanding
the
best
solution
still
lacking.
Results
Here,
we
systematically
evaluate
47
gene
(HVG)
selection
methods,
consisting
21
baseline
developed
based
on
different
data
transformations
and
mean-variance
adjustment
techniques
26
hybrid
mixtures
methods.
Across
19
diverse
benchmark
datasets,
18
objective
evaluation
criteria
per
method,
5,358
settings,
observe
that
no
single
method
consistently
outperforms
others
across
all
datasets
criteria.
However,
as
group
robustly
outperform
individual
Based
these
findings,
new
HVG
approach,
mixture
(mixHVG),
incorporates
top-ranked
from
multiple
better
to
selection.
An
open
source
R
package
mixhvg
enable
convenient
use
mixHVG
its
integration
into
users’
pipelines.
Conclusion
Our
study
not
only
provides
comparison
existing
leading
solution,
but
also
creates
pipeline
resource
for
evaluating
future.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2021,
Номер
unknown
Опубликована: Ноя. 19, 2021
Abstract
Single
cell
RNAseq
(scRNAseq)
batches
range
from
technical-replicates
to
multi-tissue
atlases,
thus
requiring
robust
batch-correction
methods
that
operate
effectively
across
this
spectrum
of
between-batch
similarity.
Commonly
employed
benchmarks
quantify
removal
batch
effects
and
preservation
within-batch
variation,
the
biologically
meaningful
differences
between
has
been
under-researched.
Here,
we
address
these
gaps,
quantifying
at
level
cluster
composition
along
overlapping
topologies
through
introduction
two
new
measures.
We
discovered
standard
approaches
scRNAseq
erase
cell-type
cell-state
variation
in
real-world
biological
datasets,
single
gene
expression
silico
experiments.
highlight
examples
showing
issues
may
create
artefactual
appearance
external
validation/replication
findings.
Our
results
demonstrate
either
effects,
if
known,
must
be
balanced
(like
bulk-techniques),
or
technical
vary
explicitly
modeled
prevent
erasure
by
unsupervised
correction
approaches.