Test-based
generate-and-validate
automated
program
repair
(APR)
systems
often
generate
many
patches
that
pass
the
test
suite
without
fixing
bug.
The
generated
must
be
manually
inspected
by
developers,
so
previous
research
proposed
various
techniques
for
automatic
correctness
assessment
of
APR-generated
patches.
Among
them,
dynamic
patch
rely
on
assumption
that,
when
running
originally
passing
cases,
correct
will
not
alter
behavior
in
a
significant
way,
e.g.,
removing
code
implementing
functionality
program.
In
this
paper,
we
propose
and
evaluate
novel
technique,
named
Shibboleth,
test-based
APR
systems.
Unlike
existing
works,
impact
is
captured
along
three
complementary
facets,
allowing
more
effective
assessment.
Specifically,
measure
both
production
(via
syntactic
semantic
similarity)
coverage
tests)
to
separate
result
similar
programs
do
delete
desired
elements.
Shibboleth
assesses
via
ranking
classification.
We
evaluated
1,871
patches,
29
Java-based
Defects4J
programs.
technique
outperforms
state-of-the-art
classification
techniques.
our
data
set,
43%
(66%)
ranks
top-1
(top-2)
positions,
mode
applied
it
achieves
an
accuracy
F1-score
0.887
0.852,
respectively.
To
support
software
developers
in
finding
and
fixing
bugs,
several
automated
program
repair
techniques
have
been
introduced.
Given
a
test
suite,
standard
methods
usually
either
synthesize
repair,
or
navigate
search
space
of
edits
to
find
test-suite
passing
variants.
Recent
are
based
on
deep
learning
approaches.
One
these
novel
methods,
which
is
not
primarily
intended
for
but
still
suitable
it,
ChatGPT.
The
bug
performance
ChatGPT,
however,
so
far
unclear.
Therefore,
this
paper
we
evaluate
ChatGPT
the
benchmark
set,
QuixBugs,
compare
with
results
other
approaches
reported
literature.
We
that
ChatGPT's
competitive
common
CoCoNut
Codex
notably
better
than
In
contrast
previous
approaches,
offers
dialogue
system
through
further
information,
e.g.,
expected
output
certain
input
an
observed
error
message,
can
be
entered.
By
providing
such
hints
its
success
rate
increased,
31
out
40
outperforming
state-of-the-art.
Automatic
program
repair
(APR)
is
crucial
to
improve
software
reliability.
Recently,
neural
machine
translation
(NMT)
techniques
have
been
used
automatically
fix
bugs.
While
promising,
these
approaches
two
major
limitations.
Their
search
space
often
does
not
contain
the
correct
fix,
and
their
strategy
ignores
knowledge
such
as
strict
code
syntax.
Due
limitations,
existing
NMT-based
underperform
best
template-based
approaches.
We
propose
CURE,
a
new
APR
technique
with
three
novelties.
First,
CURE
pre-trains
programming
language
(PL)
model
on
large
codebase
learn
developer-like
source
before
task.
Second,
designs
code-aware
that
finds
more
fixes
by
focusing
searching
for
compilable
patches
are
close
in
length
buggy
code.
Finally,
uses
subword
tokenization
generate
smaller
contains
fixes.
Our
evaluation
widely-used
benchmarks
shows
correctly
57
Defects4J
bugs
26
QuixBugs
bugs,
outperforming
all
both
benchmarks.
Automated
program
repair
(APR)
aims
to
help
developers
improve
software
reliability
by
generating
patches
for
buggy
programs.
Although
many
code
language
models
(CLM)
are
developed
and
effective
in
tasks
such
as
completion,
there
has
been
little
comprehensive,
in-depth
work
evaluate
CLMs'
fixing
capabilities
fine-tune
CLMs
the
APR
task.
Firstly,
this
is
first
ten
on
four
benchmarks,
which
shows
that
surprisingly,
best
CLM,
is,
fixes
72%
more
bugs
than
state-of-the-art
deep-learning
(DL)-based
techniques.
Secondly,
one
of
benchmarks
was
created
us
paper
avoid
data
leaking
a
fair
evaluation.
Thirdly,
it
with
training
data,
fine-tuning
brings
31%-1,267%
improvement
enables
them
fix
46%-164
%
existing
DL-based
Fourthly,
studies
impact
lines,
showing
CLMs,
cannot
make
good
use
lines
bugs,
yet
fine-tuned
could
potentially
over-rely
lines.
Lastly,
analyzes
size,
time,
memory
efficiency
different
CLMs.
This
promising
directions
domain,
APR-specific
designs,
also
raises
awareness
comprehensive
evaluations
calls
transparent
reporting
open-source
repositories
used
pre-training
address
problem.
In
the
past
decade,
research
on
test-suite-based
automatic
program
repair
has
grown
significantly.
Each
year,
new
approaches
and
implementations
are
featured
in
major
software
engineering
venues.
However,
most
of
those
evaluated
a
single
benchmark
bugs,
which
also
rarely
reproduced
by
other
researchers.
this
paper,
we
present
large-scale
experiment
using
11
Java
tools
2,141
bugs
from
5
benchmarks.
Our
goal
is
to
have
better
understanding
current
state
large
diversity
investigation
guided
hypothesis
that
repairability
might
not
be
generalized
across
different
We
found
1)
able
generate
patches
for
21%
benchmarks,
2)
performance
Defects4J
compared
generating
47%
10-30%
comprises
23,551
attempts,
used
find
causes
non-patch
generation.
These
reported
can
help
tool
designers
improve
their
tools.
Test-based
automated
program
repair
has
been
a
prolific
field
of
research
in
software
engineering
the
last
decade.
Many
approaches
have
indeed
proposed,
which
leverage
test
suites
as
weak,
but
affordable,
approximation
to
specifications.
Although
literature
regularly
sets
new
records
on
number
benchmark
bugs
that
can
be
fixed,
several
studies
increasingly
raise
concerns
about
limitations
and
biases
state-of-the-art
approaches.
For
example,
correctness
generated
patches
questioned
studies,
while
other
researchers
pointed
out
evaluation
schemes
may
misleading
with
respect
processing
fault
localization
results.
Nevertheless,
there
is
little
work
addressing
efficiency
patch
generation,
regard
practicality
repair.
In
this
paper,
we
fill
gap
literature,
by
providing
an
extensive
review
suite
based
Our
objective
assess
candidates,
since
information
correlated
(1)
strategy
traverse
search
space
efficiently
order
select
sensical
attempts,
(2)
minimize
effort
for
identifying
plausible
patch,
(3)
well
prioritize
generation
correct
patch.
To
end,
perform
large-scale
empirical
study
efficiency,
terms
quantity
candidates
16
open-source
tools
Java
programs.
The
experiments
are
carefully
conducted
under
same
configurations
limit
biases.
ACM Transactions on Software Engineering and Methodology,
Journal Year:
2022,
Volume and Issue:
31(3), P. 1 - 29
Published: May 18, 2022
Despite
the
capability
in
successfully
fixing
more
and
real-world
bugs,
existing
Automated
Program
Repair
(APR)
techniques
are
still
challenged
by
long-standing
overfitting
problem
(i.e.,
a
generated
patch
that
passes
all
tests
is
actually
incorrect).
Plenty
of
approaches
have
been
proposed
for
automated
correctness
assessment
(APCA
).
Nonetheless,
dynamic
ones
those
needed
to
execute
tests)
time-consuming
while
static
built
on
top
code
features)
less
precise.
Therefore,
embedding
recently,
which
assess
via
token
sequences
extracted
from
changed
patch.
However,
rarely
considered
context
information
program
structures
patch,
crucial
as
revealed
studies.
In
this
study,
we
explore
idea
context-aware
change
considering
assessment.
Specifically,
given
not
only
focus
but
also
take
correlated
unchanged
part
into
consideration,
through
can
be
leveraged.
We
then
utilize
AST
path
technique
representation
where
structure
node
captured.
Finally,
based
several
pre-defined
heuristics,
build
deep
learning
classifier
predict
implemented
Cache
performed
extensive
experiments
its
effectiveness.
Our
results
demonstrate
(1)
perform
better
than
previous
(e.g.,
relatively
outperforms
\(
\approx
\)
6%,
3%,
16%,
respectively
under
three
diverse
experiment
settings),
(2)
achieve
overall
higher
performance
APCA
even
being
precise
certain
including
PATCH-SIM
(92.9%
vs.
83.0%).
Further
reveal
leveraged
contributed
significantly
outstanding
performance.
ACM Transactions on Software Engineering and Methodology,
Journal Year:
2023,
Volume and Issue:
33(2), P. 1 - 69
Published: Nov. 6, 2023
Automated
program
repair
(APR)
aims
to
fix
software
bugs
automatically
and
plays
a
crucial
role
in
development
maintenance.
With
the
recent
advances
deep
learning
(DL),
an
increasing
number
of
APR
techniques
have
been
proposed
leverage
neural
networks
learn
bug-fixing
patterns
from
massive
open-source
code
repositories.
Such
learning-based
usually
treat
as
machine
translation
(NMT)
task,
where
buggy
snippets
(i.e.,
source
language)
are
translated
into
fixed
target
automatically.
Benefiting
powerful
capability
DL
hidden
relationships
previous
datasets,
achieved
remarkable
performance.
In
this
article,
we
provide
systematic
survey
summarize
current
state-of-the-art
research
community.
We
illustrate
general
workflow
detail
components,
including
fault
localization,
patch
generation,
ranking,
validation,
correctness
phases.
then
discuss
widely
adopted
datasets
evaluation
metrics
outline
existing
empirical
studies.
several
critical
aspects
techniques,
such
domains,
industrial
deployment,
open
science
issue.
highlight
practical
guidelines
on
applying
for
future
studies,
exploring
explainable
generation
utilizing
features.
Overall,
our
article
can
help
researchers
gain
comprehensive
understanding
about
achievements
promote
application
these
techniques.
Our
artifacts
publicly
available
at
repository:
https://github.com/iSEngLab/AwesomeLearningAPR
.
Automated
Program
Repair
(APR)
improves
soft-ware
reliability
by
generating
patches
for
a
buggy
program
automatically.
Recent
APR
techniques
leverage
deep
learning
(DL)
to
build
models
learn
generate
from
existing
and
code
corpora.
While
promising,
DL-based
suffer
the
abundant
syntactically
or
semantically
incorrect
in
patch
space.
These
often
disobey
syntactic
semantic
domain
knowledge
of
source
thus
cannot
be
correct
fix
bug.
We
propose
approach
KNOD,
which
in-corporates
guide
generation
direct
comprehensive
way.
KNOD
has
two
major
novelties,
including
(1)
novel
three-stage
tree
decoder,
directly
generates
Abstract
Syntax
Trees
patched
according
inherent
structure,
(2)
domain-rule
distillation,
leverages
rules
teacher-student
distributions
explicitly
inject
into
decoding
procedure
during
both
training
inference
phases.
evaluate
on
three
widely-used
benchmarks.
fixes
72
bugs
Defects4J
v1.2,
25
QuixBugs,
50
additional
v2.0
benchmarks,
outperforming
all
tools.
Test-based
automated
program
repair
(APR)
has
attracted
huge
attention
from
both
industry
and
academia.
Despite
the
significant
progress
made
in
recent
studies,
overfitting
problem
(i.e.,
generated
patch
is
plausible
but
overfitting)
still
a
major
long-standing
challenge.
Therefore,
plenty
of
techniques
have
been
proposed
to
assess
correctness
patches
either
generation
phase
or
evaluation
APR
techniques.
However,
effectiveness
existing
not
systematically
compared
little
known
their
advantages
disadvantages.
To
fill
this
gap,
we
performed
large-scale
empirical
study
paper.
Specifically,
investigated
assessment
techniques,
including
static
dynamic
ones,
based
on
902
automatically
by
21
tools
4
different
categories.
Our
revealed
following
findings:
(1)
code
features
with
respect
syntax
semantics
are
generally
effective
differentiating
over
correct
ones;
(2)
can
achieve
high
precision
while
heuristics
more
towards
recall;
(3)
certain
projects
types
less
others;
(4)
highly
complementary
each
other.
For
instance,
single
technique
only
detect
at
most
53.5%
93.3%
them
be
detected
least
one
when
oracle
information
available.
Based
our
findings,
designed
an
integration
strategy
first
integrate
via
learning,
then
combine
others
majority
voting
strategy.
experiments
show
that
enhance
performance
significantly.