2022 IEEE Symposium on Computers and Communications (ISCC),
Journal Year:
2023,
Volume and Issue:
unknown, P. 245 - 251
Published: July 9, 2023
Flaky
tests
can
pose
a
challenge
for
software
development,
as
they
produce
inconsistent
results
even
when
there
are
no
changes
to
the
code
or
test.
This
leads
unreliable
and
makes
it
difficult
diagnose
troubleshoot
any
issues.
In
this
study,
we
aim
identify
flaky
test
cases
in
development
using
black-box
approach.
indicators
of
quality
cause
issues
development.
Our
proposed
model,
Fast-Flaky,
achieved
best
cross-validation
results.
per-project
validation,
showed
an
overall
increase
accuracy
but
decreased
other
metrics.
However,
were
some
projects
where
improved
with
pre-processing
techniques.
These
provide
practitioners
method
identifying
may
inspire
further
research
on
effectiveness
different
techniques
use
additional
smells.
Flaky
tests
are
that
yield
different
outcomes
when
run
on
the
same
version
of
a
program.
This
non-deterministic
behaviour
plagues
continuous
integration
with
false
signals,
wasting
developers'
time
and
reducing
their
trust
in
test
suites.
Studies
highlighted
importance
keeping
flakiness-free.
Recently,
research
community
has
been
pushing
towards
detection
flaky
by
suggesting
many
static
dynamic
approaches.
While
promising,
those
approaches
mainly
focus
classifying
as
or
not
and,
even
high
performances
reported,
it
remains
challenging
to
understand
cause
flakiness.
part
is
crucial
for
researchers
developers
aim
fix
it.
To
help
comprehension
given
test,
we
propose
FlakyCat,
first
approach
classify
based
root
category.
FlakyCat
relies
CodeBERT
code
representation
leverages
Siamese
networks
train
multi-class
classifier.
We
evaluate
set
451
collected
from
open-source
Java
projects.
Our
evaluation
shows
categorises
accurately,
an
F1
score
73%.
Furthermore,
investigate
performance
our
each
category,
revealing
Async
waits,
Unordered
collections
Time-related
accurately
classified,
while
Concurrency-related
more
predict.
Finally,
facilitate
FlakyCat's
predictions,
present
new
technique
CodeBERT-based
model
interpretability
highlights
statements
influencing
categorization.
Journal of Systems and Software,
Journal Year:
2023,
Volume and Issue:
206, P. 111837 - 111837
Published: Sept. 7, 2023
Flaky
tests
(tests
with
non-deterministic
outcomes)
pose
a
major
challenge
for
software
testing.
They
are
known
to
cause
significant
issues,
such
as
reducing
the
effectiveness
and
efficiency
of
testing
delaying
releases.
In
recent
years,
there
has
been
an
increased
interest
in
flaky
tests,
research
focusing
on
different
aspects
flakiness,
identifying
causes,
detection
methods
mitigation
strategies.
Test
flakiness
also
become
key
discussion
point
practitioners
(in
blog
posts,
technical
magazines,
etc.)
impact
is
felt
across
industry.
This
paper
presents
multivocal
review
that
investigates
how
topic,
have
addressed
both
practice.
Out
560
articles
we
reviewed,
identified
analysed
total
200
focused
(composed
109
academic
91
grey
literature
articles/posts)
structured
body
relevant
knowledge
using
four
dimensions:
detection,
responses.
For
each
those
dimensions,
provide
categorization
classify
existing
research,
discussions,
tools
With
this,
comprehensive
current
snapshot
thinking
test
covering
views
industrial
practices,
identify
limitations
opportunities
future
research.
Non-deterministic
test
behavior,
or
flakiness,
is
common
and
dreaded
among
developers.
Researchers
have
studied
the
issue
proposed
approaches
to
mitigate
it.
However,
vast
majority
of
previous
work
has
only
considered
developer-written
tests.
The
prevalence
nature
flaky
tests
produced
by
generation
tools
remain
largely
unknown.
We
ask
whether
such
also
produce
how
these
differ
from
ones.
Furthermore,
we
evaluate
mechanisms
that
suppress
generation.
sample
6
356
projects
written
in
Java
Python.
For
each
project,
generate
using
EvoSuite
(Java)
Pynguin
(Python),
execute
200
times,
looking
for
inconsistent
outcomes.
Our
results
show
flakiness
at
least
as
generated
Nevertheless,
existing
suppression
implemented
are
effective
alleviating
this
(71.7
%
fewer
tests).
Compared
tests,
causes
distributed
differently.
Their
non-deterministic
behavior
more
frequently
caused
randomness,
rather
than
networking
concurrency.
Using
suppression,
remaining
significantly
any
previously
reported,
where
most
attributable
runtime
optimizations
EvoSuite-internal
resource
thresholds.
These
insights,
with
accompanying
dataset,
can
help
maintainers
improve
tools,
give
recommendations
developers
serve
a
foundation
future
research
2022 IEEE/ACM 44th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion),
Journal Year:
2022,
Volume and Issue:
unknown, P. 120 - 124
Published: May 1, 2022
Developers
typically
run
tests
after
code
changes.
Flaky
tests,
which
are
that
can
nondeterministically
pass
and
fail
when
on
the
same
version
of
code,
mislead
developers
about
their
recent
Much
prior
work
flaky
is
focused
Java
projects.
One
prominent
category
order-dependent
(OD)
or
depending
order
in
run.
For
example,
our
proposed
using
other
test
suite
to
fix
(or
correctly
set
up)
state
needed
by
OD
pass.Unlike
programming
languages
have
received
less
attention.
To
help
with
this
problem,
another
piece
recently
studied
Python
projects
detected
many
tests.
Unfortunately,
did
not
identify
suites
be
used
fill
gap,
we
propose
iPFlakies,
a
framework
for
automatically
detecting
fixing
Using
extend
work's
dataset
include
(1)
reproduce
(2)
patches
Our
finds
reproducing
passing
failing
results
difficult
iPFlakies
effective
at
aid
future
research,
make
framework,
improvements,
experimental
infrastructure
publicly
available.
Context:
Test
flakiness
arises
when
test
cases
have
a
non-deterministic,
intermittent
behavior
that
leads
them
to
either
pass
or
fail
run
against
the
same
code.
While
researchers
been
contributing
detection,
classification,
and
removal
of
flaky
tests
with
several
empirical
studies
automated
techniques,
little
is
known
about
how
problem
in
mobile
applications.Objective:
We
point
out
lack
knowledge
on:
(1)
The
prominence
harmfulness
problem;
(2)
most
frequent
root
causes
inducing
flakiness;
(3)
strategies
applied
by
practitioners
deal
it
practice.
An
improved
understanding
these
matters
may
lead
software
engineering
research
community
assess
need
for
tailoring
existing
instruments
context
brand-new
approaches
focus
on
peculiarities
identified.Method:
address
this
gap
means
an
study
into
developer's
perception
flakiness.
first
perform
systematic
grey
literature
review
elicit
developers
discuss
wild.
Then,
we
complement
through
survey
involves
130
aims
at
analyzing
their
experience
matter.Result:
results
indicate
are
often
concerned
connected
user
interface
elements.
In
addition,
our
reveals
perceived
as
critical
developers,
who
pointed
major
production
code-
source
code
design-related
flakiness,
other
than
long-term
effects
recurrent
tests.
Furthermore,
lets
diagnosing
fixing
processes
currently
adopted
limitations
emerge.Conclusion:
conclude
distilling
lessons
learned,
implications,
future
directions.
Information and Software Technology,
Journal Year:
2024,
Volume and Issue:
168, P. 107394 - 107394
Published: Jan. 6, 2024
Test
flakiness
arises
when
test
cases
have
a
non-deterministic,
intermittent
behavior
that
leads
them
to
either
pass
or
fail
run
against
the
same
code.
While
researchers
been
contributing
detection,
classification,
and
removal
of
flaky
tests
with
several
empirical
studies
automated
techniques,
little
is
known
about
how
problem
in
mobile
applications.
We
point
out
lack
knowledge
on:
(1)
The
prominence
harmfulness
problem;
(2)
most
frequent
root
causes
inducing
flakiness;
(3)
strategies
applied
by
practitioners
deal
it
practice.
An
improved
understanding
these
matters
may
lead
software
engineering
research
community
assess
need
for
tailoring
existing
instruments
context
brand-new
approaches
focus
on
peculiarities
identified.
address
this
gap
means
an
study
into
developer's
perception
flakiness.
first
perform
systematic
grey
literature
review
elicit
developers
discuss
wild.
Then,
we
complement
through
survey
involves
130
aims
at
analyzing
their
experience
matter.
results
indicate
are
often
concerned
connected
user
interface
elements.
In
addition,
our
reveals
perceived
as
critical
developers,
who
pointed
major
production
code-
source
code
design-related
flakiness,
other
than
long-term
effects
recurrent
tests.
Furthermore,
lets
diagnosing
fixing
processes
currently
adopted
limitations
emerge.
conclude
distilling
lessons
learned,
implications,
future
directions.
ACM Transactions on Software Engineering and Methodology,
Journal Year:
2024,
Volume and Issue:
unknown
Published: Sept. 13, 2024
Asynchronous
waits
are
a
common
root
cause
of
flaky
tests
and
major
time-influential
factor
web
application
testing.
We
build
dataset
49
reproducible
asynchronous
wait
their
fixes
from
26
open-source
projects
to
study
characteristics
in
Our
reveals
that
developers
adjusted
time
address
flakiness
about
63%
cases
(31
out
49),
even
when
the
underlying
causes
lie
elsewhere.
From
this,
we
introduce
TRaf
,
an
automated
time-based
repair
for
applications.
determines
appropriate
times
calls
applications
by
analyzing
code
similarity
past
change
history.
Its
key
insight
is
efficient
can
be
inferred
current
or
codebase
since
tend
repeat
similar
mistakes.
analysis
shows
statically
suggest
shorter
alleviate
async
immediately
upon
detection,
reducing
test
execution
11.1%
compared
timeout
values
initially
chosen
developers.
With
optional
dynamic
tuning,
reduce
16.8%
its
initial
refinement
developer-written
patches
6.2%
post-refinements
these
original
patches.
Overall,
sent
16
pull
requests
our
dataset,
each
fixing
one
test,
So
far,
three
have
been
accepted