ACM Transactions on Software Engineering and Methodology,
Journal Year:
2025,
Volume and Issue:
unknown
Published: Jan. 27, 2025
The
effectiveness
of
a
test
suite
in
detecting
faults
highly
depends
on
the
quality
its
oracles.
Large
Language
Models
(LLMs)
have
demonstrated
remarkable
proficiency
tackling
diverse
software
testing
tasks.
This
paper
aims
to
present
roadmap
for
future
research
use
LLMs
oracle
automation.
We
discuss
progress
made
field
automation
before
introduction
LLMs,
identifying
main
limitations
and
weaknesses
existing
techniques.
Additionally,
we
recent
studies
this
task,
highlighting
challenges
that
arise
from
their
use,
e.g.,
how
assess
usefulness
generated
conclude
with
discussion
about
directions
opportunities
LLM-based
Deep
learning
(DL)
techniques
are
gaining
more
and
attention
in
the
software
engineering
community.
They
have
been
used
to
support
several
code-related
tasks,
such
as
automatic
bug
fixing
code
comments
generation.
Recent
studies
Natural
Language
Processing
(NLP)
field
shown
that
Text-To-Text
Transfer
Transformer
(T5)
architecture
can
achieve
state-of-the-art
performance
for
a
variety
of
NLP
tasks.
The
basic
idea
behind
T5
is
first
pre-train
model
on
large
generic
dataset
using
self-supervised
task
(e.g.,
filling
masked
words
sentences).
Once
pre-trained,
it
fine-tuned
smaller
specialized
datasets,
each
one
related
specific
language
translation,
sentence
classification).
In
this
paper,
we
empirically
investigate
how
performs
when
pre-trained
We
composed
natural
English
text
source
code.
Then,
fine-tune
by
reusing
datasets
four
previous
works
DL
to:
(i)
fix
bugs,
(ii)
inject
mutants,
(iii)
generate
assert
statements,
(iv)
comments.
compared
single
with
results
reported
original
papers
proposing
DL-based
solutions
those
show
our
model,
exploiting
additional
data
pre-training
phase,
improvements
over
baselines.
Automated
Program
Repair
(APR)
aims
to
help
developers
automatically
patch
software
bugs.
However,
current
state-of-the-art
traditional
and
learning-based
APR
techniques
face
the
problem
of
limited
variety,
failing
fix
complicated
This
is
mainly
due
reliance
on
bug-fixing
datasets
craft
templates
(traditional)
or
directly
predict
potential
patches
(learning-based).
Large
Pre-Trained
Language
Models
(LLMs),
trained
using
billions
text/code
tokens,
can
potentially
avoid
this
issue.
Very
recently,
researchers
have
leveraged
LLMs
for
without
relying
any
datasets.
Meanwhile,
such
existing
work
either
failed
include
was
not
evaluated
realistic
Thus,
true
power
modern
important
yet
be
revealed.
In
work,
we
perform
first
extensive
study
applying
APR.
We
select
9
recent
LLMs,
including
both
generative
infilling
models,
ranging
from
125M
20B
in
size.
designed
3
different
repair
settings
evaluate
ways
use
generate
patches:
1)
entire
function,
2)
fill
a
chunk
code
given
prefix
suffix
3)
output
single
line
fix.
apply
under
these
5
across
languages
compare
number
bugs
fixed,
generation
speed
compilation
rate.
also
against
tools.
Our
demonstrates
that
already
substantially
outperform
all
our
Among
studied
scaling
effect
exists
where
larger
models
tend
achieve
better
performance.
Also,
show
time
after
buggy
(adopted
infilling-style
APR)
only
generating
more
fixes
but
with
higher
Besides
generation,
consider
correct
natural
than
other
ones,
even
effective
ranking
correctness
checking.
Lastly,
LLM-based
further
boosted
via:
increasing
sample
size,
incorporating
template
information.
Search-based
software
testing
(SBST)
generates
high-coverage
test
cases
for
programs
under
with
a
combination
of
case
generation
and
mutation.
SBST's
performance
relies
on
there
being
reasonable
probability
generating
that
exercise
the
core
logic
program
test.
Given
such
cases,
SBST
can
then
explore
space
around
them
to
various
parts
program.
This
paper
explores
whether
Large
Language
Models
(LLMs)
code,
as
OpenAI's
Codex,
be
used
help
exploration.
Our
proposed
algorithm,
CodaMosa,
conducts
until
its
coverage
improvements
stall,
asks
Codex
provide
example
under-covered
functions.
These
examples
redirect
search
more
useful
areas
space.
On
an
evaluation
over
486
benchmarks,
CodaMosa
achieves
statistically
significantly
higher
many
benchmarks
(173
279)
than
it
reduces
(10
4),
compared
LLM-only
baselines.
Proceedings of the 44th International Conference on Software Engineering,
Journal Year:
2022,
Volume and Issue:
unknown, P. 2291 - 2302
Published: May 21, 2022
Code
review
is
a
practice
widely
adopted
in
open
source
and
industrial
projects.
Given
the
non-negligible
cost
of
such
process,
researchers
started
investigating
possibility
automating
specific
code
tasks.
We
recently
proposed
Deep
Learning
(DL)
models
targeting
automation
two
tasks:
first
model
takes
as
input
submitted
for
implements
it
changes
likely
to
be
recommended
by
reviewer;
second
reviewer
comment
posted
natural
language
automatically
change
required
reviewer.
While
preliminary
results
we
achieved
are
encouraging,
both
had
been
tested
rather
simple
scenarios,
substantially
simplifying
targeted
problem.
This
was
also
due
choices
made
when
designing
technique
experiments.
In
this
paper,
build
on
top
that
work
demonstrating
pre-trained
Text-To-Text
Transfer
Transformer
(T5)
can
outperform
previous
DL
Also,
conducted
our
experiments
larger
more
realistic
(and
challenging)
dataset
activities.
Large
language
models
trained
on
massive
code
corpora
can
generalize
to
new
tasks
without
the
need
for
task-specific
fine-tuning.
In
few-shot
learning,
these
take
as
input
a
prompt,
composed
of
natural
instructions,
few
instances
task
demonstration,
and
query
generate
an
output.
However,
creation
effective
prompt
code-related
in
learning
has
received
little
attention.
We
present
technique
that
automatically
retrieves
demonstrations
similar
developer
task,
based
embedding
or
frequency
analysis.
apply
our
approach,
Cedar,
two
different
programming
languages,
statically
dynamically
typed,
tasks,
namely,
test
assertion
generation
program
repair.
For
each
we
compare
Cedar
with
state-of-the-art
fine-tuned
models.
The
empirical
results
show
that,
only
relevant
demonstrations,
is
both
accuracy
76%
52%
exact
matches
repair
respectively.
generation,
outperforms
existing
by
333%
11%,
repair,
yields
189%
better
than
competitive
recent
These
findings
have
practical
implications
practitioners,
could
potentially
be
applied
multilingual
multitask
settings
language-specific
training
minimal
examples
effort.
IEEE Transactions on Software Engineering,
Journal Year:
2023,
Volume and Issue:
50(1), P. 85 - 105
Published: Nov. 28, 2023
Unit
tests
play
a
key
role
in
ensuring
the
correctness
of
software.
However,
manually
creating
unit
is
laborious
task,
motivating
need
for
automation.
Large
Language
Models
(LLMs)
have
recently
been
applied
to
various
aspects
software
development,
including
their
suggested
use
automated
generation
tests,
but
while
requiring
additional
training
or
few-shot
learning
on
examples
existing
tests.
This
paper
presents
large-scale
empirical
evaluation
effectiveness
LLMs
test
without
manual
effort.
Concretely,
we
consider
an
approach
where
LLM
provided
with
prompts
that
include
signature
and
implementation
function
under
test,
along
usage
extracted
from
documentation.
Furthermore,
if
generated
fails,
our
attempts
generate
new
fixes
problem
by
re-prompting
model
failing
error
message.
We
implement
TestPilot
,
adaptive
LLM-based
tool
JavaScript
automatically
generates
methods
given
project's
API.
evaluate
using
OpenAI's
gpt3.5-turbo
25
npm
packages
total
1,684
API
functions.
The
achieve
median
statement
coverage
70.2%
branch
52.8%.
In
contrast,
state-of-the
feedback-directed
technique,
Nessie,
achieves
only
51.3%
25.6%
coverage.
experiments
excluding
parts
information
included
show
all
components
contribute
towards
effective
suites.
also
find
92.8%
's
$\leq$
50%
similarity
(as
measured
normalized
edit
distance),
none
them
being
exact
copies.
Finally,
run
two
LLMs,
older
code-cushman-002StarCoder
which
process
publicly
documented.
Overall,
observed
similar
results
former
(68.2%
coverage),
somewhat
worse
latter
(54.0%
suggesting
influenced
size
set
LLM,
does
not
fundamentally
depend
specific
model.
Software
engineering
research
has
always
being
concerned
with
the
improvement
of
code
completion
approaches,
which
suggest
next
tokens
a
developer
will
likely
type
while
coding.
The
release
GitHub
Copilot
constitutes
big
step
forward,
also
because
its
unprecedented
ability
to
automatically
generate
even
entire
functions
from
their
natural
language
description.
While
usefulness
is
evident,
it
still
unclear
what
extent
robust.
Specifically,
we
do
not
know
semantic-preserving
changes
in
description
provided
model
have
an
effect
on
generated
function.
In
this
paper
present
empirical
study
aim
at
understanding
whether
different
but
semantically
equivalent
descriptions
result
same
recommended
A
negative
answer
would
pose
questions
robustness
deep
learning
(DL)-based
generators
since
imply
that
developers
using
wordings
describe
obtain
recommendations.
We
asked
892
Java
methods
starting
original
Javadoc
Then,
for
each
method
both
manually
and
automatically,
analyzed
predictions
by
changed.
Our
results
show
modifying
recommendations
∼46%
cases.
Also,
differences
might
impact
correctness
(±28%).
IEEE Transactions on Software Engineering,
Journal Year:
2024,
Volume and Issue:
50(6), P. 1340 - 1359
Published: March 29, 2024
Recent
advancements
in
large
language
models
(LLMs)
have
demonstrated
exceptional
success
a
wide
range
of
general
domain
tasks,
such
as
question
answering
and
following
instructions.
Moreover,
LLMs
shown
potential
various
software
engineering
applications.
In
this
study,
we
present
systematic
comparison
test
suites
generated
by
the
ChatGPT
LLM
state-of-the-art
SBST
tool
EvoSuite.
Our
is
based
on
several
critical
factors,
including
correctness,
readability,
code
coverage,
bug
detection
capability.
By
highlighting
strengths
weaknesses
(specifically
ChatGPT)
generating
unit
cases
compared
to
EvoSuite,
work
provides
valuable
insights
into
performance
solving
problems.
Overall,
our
findings
underscore
pave
way
for
further
research
area.
Maintaining
large
code
bases
written
in
dynamically
typed
languages,
such
as
JavaScript
or
Python,
can
be
challenging
due
to
the
absence
of
type
annotations:
simple
data
compatibility
errors
proliferate,
IDE
support
is
limited,
and
APIs
are
hard
comprehend.
Recent
work
attempts
address
those
issues
through
either
static
inference
probabilistic
prediction.
Unfortunately,
for
dynamic
languages
inherently
while
approaches
suffer
from
imprecision.
This
paper
presents
TypeWriter,
first
combination
prediction
with
search-based
refinement
predicted
types.
TypeWriter's
predictor
learns
infer
return
argument
types
functions
partially
annotated
by
combining
natural
language
properties
programming
language-level
information.
To
validate
types,
TypeWriter
invokes
a
gradual
checker
different
combinations
navigating
space
possible
feedback-directed
manner.
We
implement
approach
Python
evaluate
it
on
two
corpora:
multi-million
line
base
at
Facebook
collection
1,137
popular
open-source
projects.
show
that
achieves
an
F1
score
0.64
(0.79)
top-1
(top-5)
predictions
0.57
(0.80)
which
clearly
outperforms
prior
models.
By
validation,
fully
annotate
between
14%
44%
files
randomly
selected
corpus,
ensuring
correctness.
A
comparison
tool
shows
adds
many
more
non-trivial
currently
suggests
developers
several
thousands
have
already
been
accepted
minimal
changes.
Code
reviews
are
popular
in
both
industrial
and
open
source
projects.
The
benefits
of
code
widely
recognized
include
better
quality
lower
likelihood
introducing
bugs.
However,
since
review
is
a
manual
activity
it
comes
at
the
cost
spending
developers'
time
on
reviewing
their
teammates'
code.
Our
goal
to
make
first
step
towards
partially
automating
process,
thus,
possibly
reducing
costs
associated
with
it.
We
focus
contributor
xmlns:xlink="http://www.w3.org/1999/xlink">reviewer
sides
by
training
two
different
Deep
Learning
architectures.
one
learns
changes
performed
developers
during
real
activities,
thus
providing
revised
version
her
implementing
transformations
usually
recommended
before
even
submitted
for
review.
second
automatically
provides
commenting
comments
expressed
natural
language.
empirical
evaluation
models
shows
that,
contributor
side,
trained
model
succeeds
replicating
applied
up
16%
cases.
On
reviewer
can
correctly
implement
comment
provided
language
31%
While
these
results
encouraging,
more
research
needed
usable
developers.