Test-based
generate-and-validate
automated
program
repair
(APR)
systems
often
generate
many
patches
that
pass
the
test
suite
without
fixing
bug.
The
generated
must
be
manually
inspected
by
developers,
so
previous
research
proposed
various
techniques
for
automatic
correctness
assessment
of
APR-generated
patches.
Among
them,
dynamic
patch
rely
on
assumption
that,
when
running
originally
passing
cases,
correct
will
not
alter
behavior
in
a
significant
way,
e.g.,
removing
code
implementing
functionality
program.
In
this
paper,
we
propose
and
evaluate
novel
technique,
named
Shibboleth,
test-based
APR
systems.
Unlike
existing
works,
impact
is
captured
along
three
complementary
facets,
allowing
more
effective
assessment.
Specifically,
measure
both
production
(via
syntactic
semantic
similarity)
coverage
tests)
to
separate
result
similar
programs
do
delete
desired
elements.
Shibboleth
assesses
via
ranking
classification.
We
evaluated
1,871
patches,
29
Java-based
Defects4J
programs.
technique
outperforms
state-of-the-art
classification
techniques.
our
data
set,
43%
(66%)
ranks
top-1
(top-2)
positions,
mode
applied
it
achieves
an
accuracy
F1-score
0.887
0.852,
respectively.
IEEE Transactions on Software Engineering,
Journal Year:
2021,
Volume and Issue:
48(7), P. 2658 - 2679
Published: March 18, 2021
API
misuses
are
well-known
causes
of
software
crashes
and
security
vulnerabilities.
However,
their
detection
repair
is
challenging
given
that
the
correct
usages
(third-party)
api
s
might
be
obscure
to
developers
client
programs.
This
paper
presents
first
empirical
study
assess
ability
existing
automated
bug
tools
misuses,
which
a
class
bugs
previously
unexplored.
Our
examines
compares
14
Java
test-suite-based
(11
proposed
before
2018,
three
afterwards)
on
manually
curated
benchmark
(
xmlns:xlink="http://www.w3.org/1999/xlink">APIRepBench
)
consisting
101
misuses.
We
develop
an
extensible
execution
framework
xmlns:xlink="http://www.w3.org/1999/xlink">APIARTy
automatically
execute
multiple
tools.
results
show
able
generate
patches
for
28
percent
considered.
While
11
less
recent
generally
fast
(the
median
time
attempts
3.87
minutes
mean
30.79
minutes),
most
efficient
(i.e.,
98
slower)
than
predecessors.
The
mostly
belong
categories
missing
null
check,
value,
exception,
call.
Most
generated
by
all
plausible
(65
percent),
but
only
few
these
semantically
human
(25
percent).
findings
suggest
design
future
should
support
localisation
complex
bugs,
including
different
handling
timeout
issues,
configure
large
projects.
Both
have
been
made
publicly
available
other
researchers
evaluate
capabilities
detecting
fixing
Proceedings of the ACM on Programming Languages,
Journal Year:
2025,
Volume and Issue:
9(OOPSLA1), P. 1831 - 1857
Published: April 9, 2025
Automated
Program
Repair
(APR)
holds
the
promise
of
alleviating
burden
debugging
and
fixing
software
bugs.
Despite
this,
developers
still
need
to
manually
inspect
each
patch
confirm
its
correctness,
which
is
tedious
time-consuming.
This
challenge
exacerbated
in
presence
plausible
patches,
accidentally
pass
test
cases
but
may
not
correctly
fix
bug.
To
address
this
challenge,
we
propose
an
interactive
approach
called
iFix
facilitate
understanding
comparison
based
on
their
runtime
difference.
performs
static
analysis
identify
variables
related
buggy
statement
captures
values
during
execution
for
patch.
These
are
then
aligned
across
different
candidates,
allowing
users
compare
contrast
behavior.
evaluate
iFix,
conducted
a
within-subjects
user
study
with
28
participants.
Compared
manual
inspection
state-of-the-art
filtering
technique,
reduced
participants’
task
completion
time
by
36%
33%
while
also
improving
confidence
50%
20%,
respectively.
Besides,
quantitative
experiments
demonstrate
that
improves
ranking
correct
patches
at
least
39%
compared
other
methods
generalizable
APR
tools.
Automated
Program
Repair
(APR)
aims
to
automatically
fix
bugs
in
the
source
code.
Recently,
as
advances
Deep
Learning
(DL)
field,
there
is
a
rise
of
Neural
(NPR)
studies,
which
formulate
APR
translation
task
from
buggy
code
correct
and
adopt
neural
networks
based
on
encoder-decoder
architecture.
Compared
with
other
techniques,
NPR
approaches
have
great
advantage
applicability
because
they
do
not
need
any
specification
(i.e.,
test
suite).
Although
has
been
hot
research
direction,
isn't
overview
this
field
yet.
In
order
help
interested
readers
understand
architectures,
challenges
corresponding
solutions
existing
systems,
we
conduct
literature
review
latest
studies
paper.
We
begin
introducing
background
knowledge
field.
Next,
be
understandable,
decompose
procedure
into
series
modules
explicate
various
design
choices
each
module.
Furthermore,
identify
several
discuss
effect
solutions.
Finally,
conclude
provide
some
promising
directions
for
future
research.
Empirical Software Engineering,
Journal Year:
2021,
Volume and Issue:
26(2)
Published: Feb. 23, 2021
In
this
paper,
we
do
automatic
correctness
assessment
for
patches
generated
by
program
repair
systems.
We
consider
the
human-written
patch
as
ground
truth
oracle
and
randomly
generate
tests
based
on
it,
a
technique
proposed
Shamshiri
et
al.,
called
Random
testing
with
Ground
Truth
(RGT)
in
paper.
build
curated
dataset
of
638
Defects4J
14
state-of-the-art
systems,
evaluate
automated
dataset.
The
results
study
are
novel
significant:
First,
improve
state
art
performance
RGT
190%
improving
oracle;
Second,
show
that
is
reliable
enough
to
help
scientists
overfitting
analysis
when
they
systems;
Third,
external
validity
knowledge
largest
ever.
Test-based
generate-and-validate
automated
program
repair
(APR)
systems
often
generate
many
patches
that
pass
the
test
suite
without
fixing
bug.
The
generated
must
be
manually
inspected
by
developers,
so
previous
research
proposed
various
techniques
for
automatic
correctness
assessment
of
APR-generated
patches.
Among
them,
dynamic
patch
rely
on
assumption
that,
when
running
originally
passing
cases,
correct
will
not
alter
behavior
in
a
significant
way,
e.g.,
removing
code
implementing
functionality
program.
In
this
paper,
we
propose
and
evaluate
novel
technique,
named
Shibboleth,
test-based
APR
systems.
Unlike
existing
works,
impact
is
captured
along
three
complementary
facets,
allowing
more
effective
assessment.
Specifically,
measure
both
production
(via
syntactic
semantic
similarity)
coverage
tests)
to
separate
result
similar
programs
do
delete
desired
elements.
Shibboleth
assesses
via
ranking
classification.
We
evaluated
1,871
patches,
29
Java-based
Defects4J
programs.
technique
outperforms
state-of-the-art
classification
techniques.
our
data
set,
43%
(66%)
ranks
top-1
(top-2)
positions,
mode
applied
it
achieves
an
accuracy
F1-score
0.887
0.852,
respectively.