Test
flakiness
(non-deterministic
behavior
of
test
cases)
is
an
increasingly
serious
concern
in
industrial
practice.
However,
there
are
relatively
little
research
results
available
that
systematically
address
the
analysis
and
mitigation
this
phenomena.
The
dominant
approach
to
handle
flaky
tests
still
detecting
removing
them
from
automated
executions.
some
reports
showed
amount
many
cases
so
high
we
should
rather
start
working
on
approaches
operate
presence
tests.
In
work,
investigate
how
affects
effectiveness
Spectrum
Based
Fault
Localization
(SBFL),
a
popular
class
software
(FL),
which
heavily
relies
case
execution
outcomes.
We
performed
simulation
based
experiment
find
out
what
relationship
between
level
fault
localization
effectiveness.
Our
could
help
users
FL
methods
understand
implications
area
design
novel
algorithms
take
into
account
flakiness.
Code
smells
represent
poor
implementation
choices
performed
by
developers
when
enhancing
source
code.
Their
negative
impact
on
code
maintainability
and
comprehensibility
has
been
widely
shown
in
the
past
several
techniques
to
automatically
detect
them
have
devised.
Most
of
these
are
based
heuristics,
namely
they
compute
a
set
metrics
combine
creating
detection
rules;
while
reasonable
accuracy,
recent
trend
is
represented
use
machine
learning
where
used
as
predictors
smelliness
artefacts.
Despite
advances
field,
there
still
noticeable
lack
knowledge
whether
can
actually
be
more
accurate
than
traditional
heuristic-based
approaches.
To
fill
this
gap,
paper
we
propose
large-scale
study
empirically
compare
performance
machine-learning-based
for
metric-based
smell
detection.
We
consider
five
types
models
with
DECOR,
state-of-the-art
approach.
Key
findings
emphasize
need
further
research
aimed
at
improving
effectiveness
both
heuristic
approaches
detection:
DECOR
generally
achieves
better
baseline,
its
precision
too
low
make
it
usable
practice.
ACM Transactions on Software Engineering and Methodology,
Journal Year:
2021,
Volume and Issue:
31(1), P. 1 - 74
Published: Oct. 26, 2021
Tests
that
fail
inconsistently,
without
changes
to
the
code
under
test,
are
described
as
flaky
.
Flaky
tests
do
not
give
a
clear
indication
of
presence
software
bugs
and
thus
limit
reliability
test
suites
contain
them.
A
recent
survey
developers
found
59%
claimed
deal
with
on
monthly,
weekly,
or
daily
basis.
As
well
being
detrimental
developers,
have
also
been
shown
applicability
useful
techniques
in
testing
research.
In
general,
one
can
think
threat
validity
any
methodology
assumes
outcome
only
depends
source
it
covers.
this
article,
we
systematically
body
literature
relevant
research,
amounting
76
papers.
We
split
our
analysis
into
four
parts:
addressing
causes
tests,
their
costs
consequences,
detection
strategies,
approaches
for
mitigation
repair.
Our
findings
implications
consequences
how
software-testing
community
deals
flakiness,
pertinent
practitioners
interest
those
wanting
familiarize
themselves
research
area.
Jupyter
notebooks---documents
that
contain
live
code,
equations,
visualizations,
and
narrative
text---now
are
among
the
most
popular
means
to
compute,
present,
discuss
disseminate
scientific
findings.
In
principle,
notebooks
should
easily
allow
reproduce
extend
computations
their
findings;
but
in
practice,
this
is
not
case.
The
individual
code
cells
can
be
executed
any
order,
with
identifier
usages
preceding
definitions
results
computations.
a
sample
of
936
published
would
executable
we
found
73%
them
reproducible
straightforward
approaches,
requiring
humans
infer
(and
often
guess)
order
which
authors
created
cells.
Empirical Software Engineering,
Journal Year:
2020,
Volume and Issue:
25(2), P. 1294 - 1340
Published: Feb. 4, 2020
Abstract
When
identifying
the
origin
of
software
bugs,
many
studies
assume
that
“a
bug
was
introduced
by
lines
code
were
modified
to
fix
it”.
However,
this
assumption
does
not
always
hold
and
at
least
in
some
cases,
these
are
responsible
for
introducing
bug.
For
example,
when
caused
a
change
an
external
API.
The
lack
empirical
evidence
makes
it
impossible
assess
how
important
cases
therefore,
which
extent
is
valid.
To
advance
direction,
better
understand
bugs
“are
born”,
we
propose
model
defining
criteria
identify
first
snapshot
evolving
system
exhibits
This
model,
based
on
perfect
test
idea,
decides
whether
observed
after
software.
Furthermore,
studied
model’s
carefully
analyzing
116
two
different
open
source
projects.
manual
analysis
helped
classify
root
cause
those
created
manually
curated
datasets
with
bug-introducing
changes
any
code.
Finally,
used
evaluate
performance
four
existing
SZZ-based
algorithms
detecting
changes.
We
found
very
accurate,
especially
multiple
commits
found;
F-Score
varies
from
0.44
0.77,
while
percentage
true
positives
exceed
63%.
Our
results
show
prevalent
assumption,
it”,
just
one
case
system.
Finding
what
trivial:
can
be
developers
code,
or
irrespective
Thus,
further
research
towards
understanding
projects
could
help
improve
design
integration
tests
other
procedures
make
development
more
robust.
Flaky
tests
are
software
that
exhibit
a
seemingly
random
outcome
(pass
or
fail)
when
run
against
the
same,
identical
code.
Previous
work
has
examined
fixes
to
flaky
and
proposed
automated
solutions
locate
as
well
fix
tests--we
complement
it
by
examining
perceptions
of
developers
about
nature,
relevance,
challenges
this
phenomenon.
We
asked
21
professional
classify
200
they
previously
fixed,
in
terms
nature
flakiness,
origin
fixing
effort.
analysis
with
information
strategy.
Subsequently,
we
conducted
an
online
survey
121
median
industrial
programming
experience
five
years.
Our
research
shows
that:
The
flakiness
is
due
several
different
causes,
four
which
have
never
been
reported
before,
despite
being
most
costly
fix;
perceived
significant
vast
majority
developers,
regardless
their
team's
size
project's
domain,
can
effects
on
resource
allocation,
scheduling,
reliability
test
suite;
report
face
regard
mostly
reproduction
behavior
identification
cause
for
flakiness.
Data
materials
[https://doi.org/10.5281/zenodo.3265785].
Code
smells
can
compromise
software
quality
in
the
long
term
by
inducing
technical
debt.
For
this
reason,
many
approaches
aimed
at
identifying
these
design
flaws
have
been
proposed
last
decade.
Most
of
them
are
based
on
heuristics
which
a
set
metrics
(e.g.,
code
metrics,
process
metrics)
is
used
to
detect
smelly
components.
However,
techniques
suffer
subjective
interpretation,
low
agreement
between
detectors,
and
threshold
dependability.
To
overcome
limitations,
previous
work
applied
Machine
Learning
that
learn
from
datasets
without
needing
any
definition.
more
recent
has
shown
not
always
suitable
for
smell
detection
due
highly
unbalanced
nature
problem.
In
study
we
investigate
several
able
mitigate
data
unbalancing
issues
understand
their
impact
ML-based
detection.
Our
findings
highlight
number
limitations
open
with
respect
usage
balancing