arXiv (Cornell University),
Journal Year:
2020,
Volume and Issue:
unknown
Published: Jan. 1, 2020
A
large
body
of
the
literature
automated
program
repair
develops
approaches
where
patches
are
generated
to
be
validated
against
an
oracle
(e.g.,
a
test
suite).
Because
such
can
imperfect,
patches,
although
by
oracle,
may
actually
incorrect.
While
state
art
explore
research
directions
that
require
dynamic
information
or
rely
on
manually-crafted
heuristics,
we
study
benefit
learning
code
representations
learn
deep
features
encode
properties
patch
correctness.
Our
work
mainly
investigates
different
representation
for
changes
derive
embeddings
amenable
similarity
computations.
We
report
findings
based
produced
pre-trained
and
re-trained
neural
networks.
Experimental
results
demonstrate
potential
empower
algorithms
in
reasoning
about
correctness:
machine
predictor
with
BERT
transformer-based
associated
logistic
regression
yielded
AUC
value
0.8
predicting
correctness
deduplicated
dataset
1000
labeled
patches.
shows
learned
lead
reasonable
performance
when
comparing
state-of-the-art,
PATCH-SIM,
which
relies
information.
These
further
complementary
were
carefully
(manually)
engineered
literature.
Proceedings of the 44th International Conference on Software Engineering,
Journal Year:
2022,
Volume and Issue:
unknown, P. 1506 - 1518
Published: May 21, 2022
Neural
machine
translation
(NMT)
architectures
have
achieved
promising
results
for
automatic
program
repair.
Yet,
they
the
limitation
of
generating
low-quality
patches
(e.g.,
not
compilable
patches).
This
is
because
existing
works
only
optimize
a
purely
syntactic
loss
function
based
on
characters
and
tokens
without
incorporating
program-specific
information
during
neural
network
weight
optimization.
In
this
paper,
we
propose
novel
repair
model
called
RewardRepair.
The
core
novelty
RewardRepair
to
improve
NMT-based
with
compilation
test
execution
information,
rewarding
produce
that
compile
do
overfit.
We
conduct
several
experiments
evaluate
showing
it
feasible
effective
use
underlying
model.
correctly
repairs
207
bugs
over
four
benchmarks.
report
success
121
are
fixed
first
time
in
literature.
Also,
produces
up
45.3%
patches,
an
improvement
39%
by
state-of-the-art.
In
the
past
decade,
research
on
test-suite-based
automatic
program
repair
has
grown
significantly.
Each
year,
new
approaches
and
implementations
are
featured
in
major
software
engineering
venues.
However,
most
of
those
evaluated
a
single
benchmark
bugs,
which
also
rarely
reproduced
by
other
researchers.
this
paper,
we
present
large-scale
experiment
using
11
Java
tools
2,141
bugs
from
5
benchmarks.
Our
goal
is
to
have
better
understanding
current
state
large
diversity
investigation
guided
hypothesis
that
repairability
might
not
be
generalized
across
different
We
found
1)
able
generate
patches
for
21%
benchmarks,
2)
performance
Defects4J
compared
generating
47%
10-30%
comprises
23,551
attempts,
used
find
causes
non-patch
generation.
These
reported
can
help
tool
designers
improve
their
tools.
Deep
neural
networks
(DNN)
have
been
shown
to
be
notoriously
brittle
small
perturbations
in
their
input
data.
This
problem
is
analogous
the
over-fitting
test-based
program
synthesis
and
automatic
repair,
which
a
consequence
of
incomplete
specification,
i.e.,
limited
tests
or
training
examples,
that
repair
algorithm
has
learn
from.
Recently,
test
generation
techniques
successfully
employed
augment
existing
specifications
intended
behavior,
improve
generalizability
repair.
Inspired
by
these
approaches,
this
paper,
we
propose
technique
re-purposes
software
testing
methods,
specifically
mutation-based
fuzzing,
data
DNNs,
with
objective
enhancing
robustness.
Our
casts
DNN
augmentation
as
an
optimization
problem.
It
uses
genetic
search
generate
most
suitable
variant
use
for
DNN,
while
simultaneously
identifying
opportunities
accelerate
skipping
many
instances.
We
instantiate
two
tools,
Sensei
Sensei-SA,
evaluate
them
on
15
models
spanning
5
popular
image
data-sets.
evaluation
shows
can
robust
accuracy
compared
state
art,
each
models,
upto
11.9%
5.5%
average.
Further,
Sensei-SA
reduce
average
time
25%,
still
improving
accuracy.
A
large
body
of
the
literature
automated
program
repair
develops
approaches
where
patches
are
generated
to
be
validated
against
an
oracle
(e.g.,
a
test
suite).
Because
such
can
imperfect,
patches,
although
by
oracle,
may
actually
incorrect.
While
state
art
explore
research
directions
that
require
dynamic
information
or
rely
on
manually-crafted
heuristics,
we
study
benefit
learning
code
representations
in
order
learn
deep
features
encode
properties
patch
correctness.
Our
empirical
work
mainly
investigates
different
representation
for
changes
derive
embeddings
amenable
similarity
computations.
We
report
findings
based
produced
pre-trained
and
re-trained
neural
networks.
Experimental
results
demonstrate
potential
empower
algorithms
reasoning
about
correctness:
machine
predictor
with
BERT
transformer-based
associated
logistic
regression
yielded
AUC
value
0.8
prediction
correctness
deduplicated
dataset
1000
labeled
patches.
investigations
show
learned
lead
reasonable
performance
when
comparing
state-of-the-art,
PATCH-SIM,
which
relies
information.
These
further
complementary
were
carefully
(manually)
engineered
literature.
Test-based
automated
program
repair
(APR)
has
attracted
huge
attention
from
both
industry
and
academia.
Despite
the
significant
progress
made
in
recent
studies,
overfitting
problem
(i.e.,
generated
patch
is
plausible
but
overfitting)
still
a
major
long-standing
challenge.
Therefore,
plenty
of
techniques
have
been
proposed
to
assess
correctness
patches
either
generation
phase
or
evaluation
APR
techniques.
However,
effectiveness
existing
not
systematically
compared
little
known
their
advantages
disadvantages.
To
fill
this
gap,
we
performed
large-scale
empirical
study
paper.
Specifically,
investigated
assessment
techniques,
including
static
dynamic
ones,
based
on
902
automatically
by
21
tools
4
different
categories.
Our
revealed
following
findings:
(1)
code
features
with
respect
syntax
semantics
are
generally
effective
differentiating
over
correct
ones;
(2)
can
achieve
high
precision
while
heuristics
more
towards
recall;
(3)
certain
projects
types
less
others;
(4)
highly
complementary
each
other.
For
instance,
single
technique
only
detect
at
most
53.5%
93.3%
them
be
detected
least
one
when
oracle
information
available.
Based
our
findings,
designed
an
integration
strategy
first
integrate
via
learning,
then
combine
others
majority
voting
strategy.
experiments
show
that
enhance
performance
significantly.
Existing
program
repair
systems
modify
a
buggy
so
that
the
modified
passes
given
tests.
The
repaired
may
not
satisfy
even
most
basic
notion
of
correctness,
namely
crash-freedom.
In
other
words,
tools
might
generate
patches
which
over-fit
test
data
driving
repair,
and
automatically
programs
introduce
crashes
or
vulnerabilities.
We
propose
an
integrated
approach
for
detecting
discarding
crashing
patches.
Our
fuses
patch
generation
into
single
process,
in
are
generated
with
objective
passing
existing
tests,
new
tests
filtering
out
over-fitted
by
distinguishing
candidate
terms
behavior.
use
crash-freedom
as
oracle
to
discard
candidates
crash
on
its
core,
our
defines
grey-box
fuzzing
strategy
gives
higher
priority
separate
behaving
equivalently
This
identifies
semantic
differences
between
candidates,
reduces
over-fitting
repair.
evaluated
real-world
vulnerabilities
open-source
subjects
from
Google
OSS-Fuzz
infrastructure.
found
tool
Fix2Fit
(implementing
space
directed
generation),
produces
crash-avoiding
While
we
do
give
formal
guarantees
about
crash-freedom,
cross-validation
their
sanitizers
provides
greater
confidence
suggested
IEEE Transactions on Software Engineering,
Journal Year:
2021,
Volume and Issue:
48(8), P. 2920 - 2938
Published: April 9, 2021
Automatic
program
repair
(APR)
aims
to
reduce
the
cost
of
manually
fixing
software
defects.
However,
APR
suffers
from
generating
a
multitude
overfitting
patches,
those
patches
that
fail
correctly
defect
beyond
making
tests
pass.
This
paper
presents
novel
patch
detection
system
called
ODS
assess
correctness
patches.
first
statically
compares
patched
and
buggy
in
order
extract
code
features
at
abstract
syntax
tree
(AST)
level,
for
single
programming
language
Java.
Then,
uses
supervised
learning
with
captured
labels
automatically
learn
probabilistic
model.
The
learned
model
can
then
finally
be
applied
classify
new
unseen
We
conduct
large-scale
experiment
evaluate
effectiveness
on
classification
based
10,302
Defects4J,
Bugs.jar
Bears
benchmarks.
empirical
evaluation
shows
is
able
71.9
percent
26
projects,
which
improves
state-of-the-art.
applicable
practice
employed
as
post-processing
procedure
generated
by
different
systems.
Automatic
program
repair
papers
tend
to
repeatedly
use
the
same
benchmarks.
This
poses
a
threat
external
validity
of
findings
research
community.
In
this
paper,
we
perform
an
automatic
experiment
on
benchmark
called
QuixBugs
that
has
never
been
studied
in
context
repair.
study,
report
characteristics
QuixBugs,
and
study
five
systems,
Arja,
Astor,
Nopol,
NPEfix
RSRepair,
which
are
representatives
generate-and-validate
techniques
synthesis
techniques.
We
propose
three
patch
correctness
assessment
comprehensively
overfitting
incorrect
patches.
Our
key
results
are:
1)
15/40
buggy
programs
can
be
repaired
with
test-suite
adequate
patch;
2)
total
64
plausible
patches
for
those
15
present
search
space
considered
tools;
3)
discard
33/64
overfitting.
sets
baseline
future
QuixBugs.
also
highlights
major
properties
challenges
how
automated
All
experimental
publicly
available
Github
order
facilitate