Proceedings of the ACM on Programming Languages,
Journal Year:
2025,
Volume and Issue:
9(OOPSLA1), P. 143 - 168
Published: April 9, 2025
Software
updates,
including
bug
repair
and
feature
additions,
are
frequent
in
modern
applications
but
they
often
leave
test
suites
outdated,
resulting
undetected
bugs
increased
chances
of
system
failures.
A
recent
study
by
Meta
revealed
that
14%-22%
software
failures
stem
from
outdated
tests
fail
to
reflect
changes
the
codebase.
This
highlights
need
keep
sync
with
code
ensure
reliability.
In
this
paper,
we
present
UTFix,
a
novel
approach
for
repairing
unit
when
their
corresponding
focal
methods
undergo
changes.
UTFix
addresses
two
critical
issues:
assertion
failure
reduced
coverage
caused
method.
Our
leverages
language
models
providing
contextual
information
such
as
static
slices,
dynamic
messages.
We
evaluate
on
our
generated
synthetic
benchmarks
(Tool-Bench),
real-world
benchmarks.
Tool-
Bench
includes
diverse
popular
open-source
Python
GitHub
projects,
where
successfully
repaired
89.2%
achieved
100%
96
out
369
tests.
On
benchmarks,
repairs
60%
while
achieving
19
30
To
best
knowledge,
is
first
comprehensive
focused
evolving
projects.
contributions
include
development
creation
Tool-Bench
demonstration
effectiveness
LLM-based
addressing
due
evolution.
Automated
Program
Repair
(APR)
aims
to
help
developers
automatically
patch
software
bugs.
However,
current
state-of-the-art
traditional
and
learning-based
APR
techniques
face
the
problem
of
limited
variety,
failing
fix
complicated
This
is
mainly
due
reliance
on
bug-fixing
datasets
craft
templates
(traditional)
or
directly
predict
potential
patches
(learning-based).
Large
Pre-Trained
Language
Models
(LLMs),
trained
using
billions
text/code
tokens,
can
potentially
avoid
this
issue.
Very
recently,
researchers
have
leveraged
LLMs
for
without
relying
any
datasets.
Meanwhile,
such
existing
work
either
failed
include
was
not
evaluated
realistic
Thus,
true
power
modern
important
yet
be
revealed.
In
work,
we
perform
first
extensive
study
applying
APR.
We
select
9
recent
LLMs,
including
both
generative
infilling
models,
ranging
from
125M
20B
in
size.
designed
3
different
repair
settings
evaluate
ways
use
generate
patches:
1)
entire
function,
2)
fill
a
chunk
code
given
prefix
suffix
3)
output
single
line
fix.
apply
under
these
5
across
languages
compare
number
bugs
fixed,
generation
speed
compilation
rate.
also
against
tools.
Our
demonstrates
that
already
substantially
outperform
all
our
Among
studied
scaling
effect
exists
where
larger
models
tend
achieve
better
performance.
Also,
show
time
after
buggy
(adopted
infilling-style
APR)
only
generating
more
fixes
but
with
higher
Besides
generation,
consider
correct
natural
than
other
ones,
even
effective
ranking
correctness
checking.
Lastly,
LLM-based
further
boosted
via:
increasing
sample
size,
incorporating
template
information.
Search-based
software
testing
(SBST)
generates
high-coverage
test
cases
for
programs
under
with
a
combination
of
case
generation
and
mutation.
SBST's
performance
relies
on
there
being
reasonable
probability
generating
that
exercise
the
core
logic
program
test.
Given
such
cases,
SBST
can
then
explore
space
around
them
to
various
parts
program.
This
paper
explores
whether
Large
Language
Models
(LLMs)
code,
as
OpenAI's
Codex,
be
used
help
exploration.
Our
proposed
algorithm,
CodaMosa,
conducts
until
its
coverage
improvements
stall,
asks
Codex
provide
example
under-covered
functions.
These
examples
redirect
search
more
useful
areas
space.
On
an
evaluation
over
486
benchmarks,
CodaMosa
achieves
statistically
significantly
higher
many
benchmarks
(173
279)
than
it
reduces
(10
4),
compared
LLM-only
baselines.
IEEE Transactions on Software Engineering,
Journal Year:
2023,
Volume and Issue:
50(1), P. 85 - 105
Published: Nov. 28, 2023
Unit
tests
play
a
key
role
in
ensuring
the
correctness
of
software.
However,
manually
creating
unit
is
laborious
task,
motivating
need
for
automation.
Large
Language
Models
(LLMs)
have
recently
been
applied
to
various
aspects
software
development,
including
their
suggested
use
automated
generation
tests,
but
while
requiring
additional
training
or
few-shot
learning
on
examples
existing
tests.
This
paper
presents
large-scale
empirical
evaluation
effectiveness
LLMs
test
without
manual
effort.
Concretely,
we
consider
an
approach
where
LLM
provided
with
prompts
that
include
signature
and
implementation
function
under
test,
along
usage
extracted
from
documentation.
Furthermore,
if
generated
fails,
our
attempts
generate
new
fixes
problem
by
re-prompting
model
failing
error
message.
We
implement
TestPilot
,
adaptive
LLM-based
tool
JavaScript
automatically
generates
methods
given
project's
API.
evaluate
using
OpenAI's
gpt3.5-turbo
25
npm
packages
total
1,684
API
functions.
The
achieve
median
statement
coverage
70.2%
branch
52.8%.
In
contrast,
state-of-the
feedback-directed
technique,
Nessie,
achieves
only
51.3%
25.6%
coverage.
experiments
excluding
parts
information
included
show
all
components
contribute
towards
effective
suites.
also
find
92.8%
's
$\leq$
50%
similarity
(as
measured
normalized
edit
distance),
none
them
being
exact
copies.
Finally,
run
two
LLMs,
older
code-cushman-002StarCoder
which
process
publicly
documented.
Overall,
observed
similar
results
former
(68.2%
coverage),
somewhat
worse
latter
(54.0%
suggesting
influenced
size
set
LLM,
does
not
fundamentally
depend
specific
model.
Large
language
models
trained
on
massive
code
corpora
can
generalize
to
new
tasks
without
the
need
for
task-specific
fine-tuning.
In
few-shot
learning,
these
take
as
input
a
prompt,
composed
of
natural
instructions,
few
instances
task
demonstration,
and
query
generate
an
output.
However,
creation
effective
prompt
code-related
in
learning
has
received
little
attention.
We
present
technique
that
automatically
retrieves
demonstrations
similar
developer
task,
based
embedding
or
frequency
analysis.
apply
our
approach,
Cedar,
two
different
programming
languages,
statically
dynamically
typed,
tasks,
namely,
test
assertion
generation
program
repair.
For
each
we
compare
Cedar
with
state-of-the-art
fine-tuned
models.
The
empirical
results
show
that,
only
relevant
demonstrations,
is
both
accuracy
76%
52%
exact
matches
repair
respectively.
generation,
outperforms
existing
by
333%
11%,
repair,
yields
189%
better
than
competitive
recent
These
findings
have
practical
implications
practitioners,
could
potentially
be
applied
multilingual
multitask
settings
language-specific
training
minimal
examples
effort.
Proceedings of the ACM on software engineering.,
Journal Year:
2024,
Volume and Issue:
1(FSE), P. 1585 - 1608
Published: July 12, 2024
Code
translation
tools,
namely
transpilers,
are
developed
for
automatic
source-to-source
translation.
Latest
learning-based
transpilers
have
shown
impressive
enhancement
against
rule-based
counterparts
in
both
accuracy
and
readability,
owing
to
their
task-specific
pre-training
on
extensive
monolingual
corpora.
Nevertheless,
current
performance
still
remains
unsatisfactory
practical
deployment,
the
associated
training
resources
also
prohibitively
expensive.
Large
Language
Models
(LLMs),
pre-trained
huge
amounts
of
human-written
code/text,
remarkable
many
code
intelligence
tasks
due
powerful
generality,
even
without
re-training/fine-tuning.
Thus,
LLMs
can
potentially
circumvent
above
limitations,
but
they
not
been
exhaustively
explored
yet.
This
paper
investigates
diverse
automated
tasks,
finding
that:
although
certain
outperformed
some
issues,
where
most
failures
induced
by
a
lack
comprehension
source
programs
(38.51%),
missing
clear
instructions
I/O
types
(14.94%),
ignoring
discrepancies
between
target
(41.38%).
Enlightened
findings,
we
further
propose
UniTrans
,
Uni
fied
Trans
lation
framework,
applicable
various
LLMs,
unleashing
power
this
field.
Specifically,
first
crafts
series
test
cases
with
assistance
programs.
Next,
it
harnesses
auto-generated
augment
then
evaluate
correctness
via
execution.
Afterward,
(iteratively)
repairs
incorrectly
translated
prompted
case
execution
results.
Extensive
experiments
conducted
six
settings
datasets
Python,
Java,
C++.
Three
recent
sizes,
including
GPT-3.5
LLaMA-13B/7B,
tested
all
achieve
substantial
improvements
terms
computational
exact
match
among
almost
settings,
showing
universal
effectiveness
practice.
Proceedings of the ACM on Programming Languages,
Journal Year:
2024,
Volume and Issue:
8(OOPSLA1), P. 1100 - 1124
Published: April 29, 2024
Students
often
make
mistakes
in
their
introductory
programming
assignments
as
part
of
learning
process.
Unfortunately,
providing
custom
repairs
for
these
can
require
a
substantial
amount
time
and
effort
from
class
instructors.
Automated
program
repair
(APR)
techniques
be
used
to
synthesize
such
fixes.
Prior
work
has
explored
the
use
symbolic
neural
APR
education
domain.
Both
types
approaches
either
engineering
efforts
or
large
amounts
data
training.
We
propose
language
model
trained
on
code,
Codex
(a
version
GPT),
build
an
system
--
PyDex
Python
assignments.
Our
fix
both
syntactic
semantic
by
combining
multi-modal
prompts,
iterative
querying,
test-case-based
selection
few-shots,
chunking.
evaluate
286
real
student
programs
compare
three
baselines,
including
one
that
combines
state-of-the-art
syntax
engine,
BIFI,
engine
assignments,
Refactory.
find
more
produce
smaller
patches
average.
Proceedings of the ACM on software engineering.,
Journal Year:
2024,
Volume and Issue:
1(FSE), P. 951 - 971
Published: July 12, 2024
Testing
plays
a
pivotal
role
in
ensuring
software
quality,
yet
conventional
Search
Based
Software
(SBST)
methods
often
struggle
with
complex
units,
achieving
suboptimal
test
coverage.
Recent
work
using
large
language
models
(LLMs)
for
generation
have
focused
on
improving
quality
through
optimizing
the
context
and
correcting
errors
model
outputs,
but
use
fixed
prompting
strategies
that
prompt
to
generate
tests
without
additional
guidance.
As
result
LLM-generated
testsuites
still
suffer
from
low
In
this
paper,
we
present
SymPrompt,
code-aware
strategy
LLMs
generation.
SymPrompt’s
approach
is
based
recent
demonstrates
can
solve
more
logical
problems
when
prompted
reason
about
problem
multi-step
fashion.
We
apply
methodology
by
deconstructing
testsuite
process
into
multi-stage
sequence,
each
of
which
driven
specific
aligned
execution
paths
method
under
test,
exposing
relevant
type
dependency
focal
model.
Our
enables
pretrained
complete
cases
any
training.
implement
SymPrompt
TreeSitter
parsing
framework
evaluate
benchmark
challenging
open
source
Python
projects.
enhances
correct
generations
factor
5
bolsters
relative
coverage
26%
CodeGen2.
Notably,
applied
GPT-4,
improves
over
2x
compared
baseline
strategies.
ACM Transactions on Software Engineering and Methodology,
Journal Year:
2025,
Volume and Issue:
unknown
Published: Jan. 27, 2025
The
effectiveness
of
a
test
suite
in
detecting
faults
highly
depends
on
the
quality
its
oracles.
Large
Language
Models
(LLMs)
have
demonstrated
remarkable
proficiency
tackling
diverse
software
testing
tasks.
This
paper
aims
to
present
roadmap
for
future
research
use
LLMs
oracle
automation.
We
discuss
progress
made
field
automation
before
introduction
LLMs,
identifying
main
limitations
and
weaknesses
existing
techniques.
Additionally,
we
recent
studies
this
task,
highlighting
challenges
that
arise
from
their
use,
e.g.,
how
assess
usefulness
generated
conclude
with
discussion
about
directions
opportunities
LLM-based