ACM Transactions on Software Engineering and Methodology,
Journal Year:
2025,
Volume and Issue:
unknown
Published: Jan. 27, 2025
The
effectiveness
of
a
test
suite
in
detecting
faults
highly
depends
on
the
quality
its
oracles.
Large
Language
Models
(LLMs)
have
demonstrated
remarkable
proficiency
tackling
diverse
software
testing
tasks.
This
paper
aims
to
present
roadmap
for
future
research
use
LLMs
oracle
automation.
We
discuss
progress
made
field
automation
before
introduction
LLMs,
identifying
main
limitations
and
weaknesses
existing
techniques.
Additionally,
we
recent
studies
this
task,
highlighting
challenges
that
arise
from
their
use,
e.g.,
how
assess
usefulness
generated
conclude
with
discussion
about
directions
opportunities
LLM-based
IEEE Transactions on Software Engineering,
Journal Year:
2021,
Volume and Issue:
unknown, P. 1 - 1
Published: Jan. 1, 2021
Code
completion
aims
at
speeding
up
code
writing
by
predicting
the
next
token(s)
developer
is
likely
to
write.
Works
in
this
field
focused
on
improving
accuracy
of
generated
predictions,
with
substantial
leaps
forward
made
possible
deep
learning
(DL)
models.
However,
techniques
are
mostly
evaluated
scenario
token
type,
few
exceptions
pushing
boundaries
prediction
an
entire
statement.
Thus,
little
known
about
performance
state-of-the-art
approaches
more
challenging
scenarios
which,
for
example,
block
must
be
generated.
We
present
a
large-scale
study
exploring
capabilities
Transformer-based
models
supporting
different
granularity
levels,
including
single
tokens,
one
or
multiple
statements,
blocks
(e.g.,
iterated
loop).
experimented
several
variants
two
recently
proposed
models,
namely
RoBERTa
and
Text-To-Text
Transfer
Transformer
(T5),
task
completion.
The
achieved
results
show
that
particular
T5,
represent
viable
solution
completion,
perfect
predictions
ranging
from
~29%,
obtained
when
asking
model
guess
blocks,
~69%,
reached
simpler
tokens
masked
same
Unit
testing
represents
the
foundational
basis
of
software
pyramid,
beneath
integration
and
end-to-end
testing.
Automated
researchers
have
proposed
a
variety
techniques
to
assist
developers
in
this
time-consuming
task.
Proceedings of the 44th International Conference on Software Engineering,
Journal Year:
2022,
Volume and Issue:
unknown
Published: May 21, 2022
Testing
is
widely
recognized
as
an
important
stage
of
the
software
development
lifecycle.
Effective
testing
can
provide
benefits
such
bug
finding,
preventing
regressions,
and
documentation.
In
terms
documentation,
unit
tests
express
a
unit's
intended
functionality,
conceived
by
developer.
A
test
oracle,
typically
expressed
condition,
documents
behavior
under
given
prefix.
Synthesizing
functional
oracle
challenging
problem,
it
must
capture
functionality
rather
than
implemented
functionality.
this
paper,
we
propose
TOGA
(a
neural
method
for
Test
Oracle
GenerAtion),
unified
transformer-based
approach
to
infer
both
exceptional
assertion
oracles
based
on
context
focal
method.
Our
handle
units
with
ambiguous
or
missing
even
implementation.
We
evaluate
our
inference
accuracy
bug-finding.
technique
improves
33\%
over
existing
approaches,
achieving
96\%
overall
held
out
dataset.
Furthermore,
show
that
when
integrated
automated
generation
tool
(EvoSuite),
finds
57
real
world
bugs
in
large-scale
Java
programs,
including
30
are
not
found
any
other
evaluation.
IEEE Transactions on Software Engineering,
Journal Year:
2022,
Volume and Issue:
49(4), P. 1580 - 1598
Published: June 15, 2022
Deep
learning
(DL)
techniques
have
been
used
to
support
several
code-related
tasks
such
as
code
summarization
and
bug-fixing.
In
particular,
pre-trained
transformer
models
are
on
the
rise,
also
thanks
excellent
results
they
achieved
in
Natural
Language
Processing
(NLP)
tasks.
The
basic
idea
behind
these
is
first
pre-train
them
a
generic
dataset
using
self-supervised
task
(e.g.,
filling
masked
words
sentences).
Then,
fine-tuned
specific
of
interest
language
translation).
A
single
model
can
be
multiple
tasks,
possibly
exploiting
benefits
transfer
learning
.
This
means
that
knowledge
acquired
solve
translation)
useful
boost
performance
another
sentiment
classification).
While
transfer
widely
studied
NLP,
limited
empirical
evidence
available
when
it
comes
this
paper,
we
assess
Text-To-Text
Transfer
Transformer
(T5)
supporting
four
different
tasks:
(i)
automatic
bug-fixing,
(ii)
injection
mutants,
(iii)
generation
assert
statements,
(iv)
summarization.
We
pay
particular
attention
studying
role
played
by
pre-training
multi-task
fine-tuning
model's
performance.
show
T5
achieve
better
compared
state-of-the-art
baselines;
while
helps
model,
not
all
benefit
from
fine-tuning.
While
a
large
number
of
pre-trained
models
source
code
have
been
successfully
developed
and
applied
to
variety
software
engineering
(SE)
tasks
in
recent
years,
our
understanding
these
is
arguably
fairly
limited.
With
the
goal
advancing
models,
we
perform
first
systematic
empirical
comparison
19
recently-developed
on
13
SE
tasks.
To
gain
additional
insights
into
adopt
recently
-developed
4-dimensional
categorization
subsequently
investigate
whether
there
are
correlations
between
different
categories
their
performances
Proceedings of the ACM on software engineering.,
Journal Year:
2024,
Volume and Issue:
1(FSE), P. 1703 - 1726
Published: July 12, 2024
Unit
testing
plays
an
essential
role
in
detecting
bugs
functionally-discrete
program
units
(e.g.,
methods).
Manually
writing
high-quality
unit
tests
is
time-consuming
and
laborious.
Although
the
traditional
techniques
are
able
to
generate
with
reasonable
coverage,
they
shown
exhibit
low
readability
still
cannot
be
directly
adopted
by
developers
practice.
Recent
work
has
large
potential
of
language
models
(LLMs)
test
generation.
By
being
pre-trained
on
a
massive
developer-written
code
corpus,
capable
generating
more
human-like
meaningful
code.
In
this
work,
we
perform
first
empirical
study
evaluate
capability
ChatGPT
(i.e.,
one
most
representative
LLMs
outstanding
performance
generation
comprehension)
particular,
conduct
both
quantitative
analysis
user
systematically
investigate
quality
its
generated
terms
correctness,
sufficiency,
readability,
usability.
We
find
that
suffer
from
correctness
issues,
including
diverse
compilation
errors
execution
failures
(mostly
caused
incorrect
assertions);
but
passing
almost
resemble
manually-written
achieving
comparable
even
sometimes
developers'
preference.
Our
findings
indicate
could
very
promising
if
further
improved.
Inspired
our
above,
propose
ChatTester,
novel
ChatGPT-based
approach,
which
leverages
itself
improve
tests.
ChatTester
incorporates
initial
generator
iterative
refiner.
evaluation
demonstrates
effectiveness
34.3%
compilable
18.7%
correct
assertions
than
default
ChatGPT.
addition
ChatGPT,
generalization
capabilities
applying
it
two
recent
open-source
CodeLLama-Instruct
CodeFuse)
results
show
can
also
these
LLMs.
Code
completion
is
one
of
the
main
features
modern
Integrated
Development
Environments
(IDEs).
Its
objective
to
speed
up
code
writing
by
predicting
next
token(s)
developer
likely
write.
Research
in
this
area
has
substantially
bolstered
predictive
performance
these
techniques.
However,
support
developers
still
limited
prediction
few
tokens
type.
In
work,
we
take
a
step
further
direction
presenting
large-scale
empirical
study
aimed
at
exploring
capabilities
state-of-the-art
deep
learning
(DL)
models
supporting
different
granularity
levels,
including
single
tokens,
or
multiple
entire
statements,
blocks
(e.g.,
iterated
block
for
loop).
To
aim,
train
and
test
several
adapted
variants
recently
proposed
RoBERTa
model,
evaluate
its
predictions
from
perspectives,
including:
(i)
metrics
usually
adopted
when
assessing
DL
generative
(i.e.,
BLEU
score
Levenshtein
distance);
(ii)
percentage
perfect
predicted
snippets
that
match
those
written
developers);
(iii)
"semantic"
equivalence
generated
as
compared
developers.
The
achieved
results
show
BERT
represent
viable
solution
completion,
with
ranging
~7%,
obtained
asking
model
guess
blocks,
~58%,
reached
simpler
scenario
masked
same
statement.
Proceedings of the 44th International Conference on Software Engineering,
Journal Year:
2022,
Volume and Issue:
unknown, P. 163 - 174
Published: May 21, 2022
Unit
testing
could
be
used
to
validate
the
correctness
of
basic
units
software
system
under
test.
To
reduce
manual
efforts
in
conducting
unit
testing,
research
community
has
contributed
with
tools
that
automatically
generate
test
cases,
including
inputs
and
oracles
(e.g.,
assertions).
Recently,
ATLAS,
a
deep
learning
(DL)
based
approach,
was
proposed
assertions
for
on
other
already
written
tests.
Despite
promising,
effectiveness
ATLAS
is
still
limited.
improve
effectiveness,
this
work,
we
make
first
attempt
leverage
Information
Retrieval
(IR)
assertion
generation
propose
an
IR-based
technique
retrieval
retrieved-assertion
adaptation.
In
addition,
integration
approach
combine
our
DL-based
ATLAS)
further
effectiveness.
Our
experimental
results
show
outperforms
state-of-the-art
integrating
can
achieve
higher
accuracy.
convey
important
message
information
competitive
worthwhile
pursue
engineering
tasks
such
as
generation,
should
seriously
considered
by
given
recent
years
solutions
have
been
over-popularly
adopted
tasks.
Writing
tests
is
a
time-consuming
yet
essential
task
during
software
development.
We
propose
to
leverage
recent
advances
in
deep
learning
for
text
and
code
generation
assist
developers
writing
tests.
formalize
the
novel
of
test
completion
automatically
complete
next
statement
method
based
on
context
prior
statements
under
test.
develop
TECo-a
model
using
semantics
completion.
The
key
insight
underlying
TECO
that
predicting
requires
reasoning
about
execution,
which
hard
do
with
only
syntax-level
data
existing
models
use.
Teco
extracts
uses
six
kinds
data,
including
execution
result
method.
To
provide
testbed
this
new
task,
as
well
evaluate
TECO,
we
collect
corpus
130,934
methods
from
1,270
open-source
Java
projects.
Our
results
show
achieves
an
exact-match
accuracy
18,
29%
higher
than
best
baseline
only.
When
measuring
functional
correctness
generated
statement,
can
generate
runnable
cases
compared
18%
obtained
by
baseline.
Moreover,
sianificantly
better
work
oracle
generation.