Unit
testing
has
become
an
essential
practice
during
software
development
and
maintenance.
Effective
unit
tests
can
help
guard
improve
quality
but
require
a
substantial
amount
of
time
effort
to
write
maintain.
A
test
consists
prefix
oracle.
Synthesizing
oracles,
especially
functional
is
well-known
challenging
problem.
Recent
studies
proposed
leverage
neural
models
generate
i.e.,
oracle
generation
(NTOG),
obtained
promising
results.
However,
after
systematic
inspection,
we
find
there
are
some
inappropriate
settings
in
existing
evaluation
methods
for
NTOG.
These
could
mislead
the
understanding
NTOG
approaches'
performance.
We
summarize
them
as
1)
generating
prefixes
from
bug-fixed
program
versions,
2)
evaluating
with
unrealistic
metric,
3)
lacking
straightforward
baseline.
In
this
paper,
first
investigate
impacts
these
on
performance
approaches.
that
unrealistically
versions
inflates
number
bugs
found
by
state-of-the-art
approach
TOGA
61.8%,
FPR
(False
Positive
Rate)
not
realistic
metric
Precision
only
0.38%,
baseline
NoException,
which
simply
expects
no
exception
should
be
raised,
61%
twice
Precision.
Furthermore,
introduce
additional
ranking
step
propose
named
Found@K
better
measure
cost-effectiveness
approaches
terms
bug-finding.
novel
unsupervised
method
instantiate
step,
significantly
improving
TOGA.
Eventually,
based
our
experimental
results
observations,
more
TEval+
seven
rules
thumb
boost
into
their
practical
usages.
Automated
Program
Repair
(APR)
aims
to
help
developers
automatically
patch
software
bugs.
However,
current
state-of-the-art
traditional
and
learning-based
APR
techniques
face
the
problem
of
limited
variety,
failing
fix
complicated
This
is
mainly
due
reliance
on
bug-fixing
datasets
craft
templates
(traditional)
or
directly
predict
potential
patches
(learning-based).
Large
Pre-Trained
Language
Models
(LLMs),
trained
using
billions
text/code
tokens,
can
potentially
avoid
this
issue.
Very
recently,
researchers
have
leveraged
LLMs
for
without
relying
any
datasets.
Meanwhile,
such
existing
work
either
failed
include
was
not
evaluated
realistic
Thus,
true
power
modern
important
yet
be
revealed.
In
work,
we
perform
first
extensive
study
applying
APR.
We
select
9
recent
LLMs,
including
both
generative
infilling
models,
ranging
from
125M
20B
in
size.
designed
3
different
repair
settings
evaluate
ways
use
generate
patches:
1)
entire
function,
2)
fill
a
chunk
code
given
prefix
suffix
3)
output
single
line
fix.
apply
under
these
5
across
languages
compare
number
bugs
fixed,
generation
speed
compilation
rate.
also
against
tools.
Our
demonstrates
that
already
substantially
outperform
all
our
Among
studied
scaling
effect
exists
where
larger
models
tend
achieve
better
performance.
Also,
show
time
after
buggy
(adopted
infilling-style
APR)
only
generating
more
fixes
but
with
higher
Besides
generation,
consider
correct
natural
than
other
ones,
even
effective
ranking
correctness
checking.
Lastly,
LLM-based
further
boosted
via:
increasing
sample
size,
incorporating
template
information.
Search-based
software
testing
(SBST)
generates
high-coverage
test
cases
for
programs
under
with
a
combination
of
case
generation
and
mutation.
SBST's
performance
relies
on
there
being
reasonable
probability
generating
that
exercise
the
core
logic
program
test.
Given
such
cases,
SBST
can
then
explore
space
around
them
to
various
parts
program.
This
paper
explores
whether
Large
Language
Models
(LLMs)
code,
as
OpenAI's
Codex,
be
used
help
exploration.
Our
proposed
algorithm,
CodaMosa,
conducts
until
its
coverage
improvements
stall,
asks
Codex
provide
example
under-covered
functions.
These
examples
redirect
search
more
useful
areas
space.
On
an
evaluation
over
486
benchmarks,
CodaMosa
achieves
statistically
significantly
higher
many
benchmarks
(173
279)
than
it
reduces
(10
4),
compared
LLM-only
baselines.
Large
language
models
trained
on
massive
code
corpora
can
generalize
to
new
tasks
without
the
need
for
task-specific
fine-tuning.
In
few-shot
learning,
these
take
as
input
a
prompt,
composed
of
natural
instructions,
few
instances
task
demonstration,
and
query
generate
an
output.
However,
creation
effective
prompt
code-related
in
learning
has
received
little
attention.
We
present
technique
that
automatically
retrieves
demonstrations
similar
developer
task,
based
embedding
or
frequency
analysis.
apply
our
approach,
Cedar,
two
different
programming
languages,
statically
dynamically
typed,
tasks,
namely,
test
assertion
generation
program
repair.
For
each
we
compare
Cedar
with
state-of-the-art
fine-tuned
models.
The
empirical
results
show
that,
only
relevant
demonstrations,
is
both
accuracy
76%
52%
exact
matches
repair
respectively.
generation,
outperforms
existing
by
333%
11%,
repair,
yields
189%
better
than
competitive
recent
These
findings
have
practical
implications
practitioners,
could
potentially
be
applied
multilingual
multitask
settings
language-specific
training
minimal
examples
effort.
IEEE Transactions on Software Engineering,
Journal Year:
2023,
Volume and Issue:
50(1), P. 85 - 105
Published: Nov. 28, 2023
Unit
tests
play
a
key
role
in
ensuring
the
correctness
of
software.
However,
manually
creating
unit
is
laborious
task,
motivating
need
for
automation.
Large
Language
Models
(LLMs)
have
recently
been
applied
to
various
aspects
software
development,
including
their
suggested
use
automated
generation
tests,
but
while
requiring
additional
training
or
few-shot
learning
on
examples
existing
tests.
This
paper
presents
large-scale
empirical
evaluation
effectiveness
LLMs
test
without
manual
effort.
Concretely,
we
consider
an
approach
where
LLM
provided
with
prompts
that
include
signature
and
implementation
function
under
test,
along
usage
extracted
from
documentation.
Furthermore,
if
generated
fails,
our
attempts
generate
new
fixes
problem
by
re-prompting
model
failing
error
message.
We
implement
TestPilot
,
adaptive
LLM-based
tool
JavaScript
automatically
generates
methods
given
project's
API.
evaluate
using
OpenAI's
gpt3.5-turbo
25
npm
packages
total
1,684
API
functions.
The
achieve
median
statement
coverage
70.2%
branch
52.8%.
In
contrast,
state-of-the
feedback-directed
technique,
Nessie,
achieves
only
51.3%
25.6%
coverage.
experiments
excluding
parts
information
included
show
all
components
contribute
towards
effective
suites.
also
find
92.8%
's
$\leq$
50%
similarity
(as
measured
normalized
edit
distance),
none
them
being
exact
copies.
Finally,
run
two
LLMs,
older
code-cushman-002StarCoder
which
process
publicly
documented.
Overall,
observed
similar
results
former
(68.2%
coverage),
somewhat
worse
latter
(54.0%
suggesting
influenced
size
set
LLM,
does
not
fundamentally
depend
specific
model.
Proceedings of the ACM on software engineering.,
Journal Year:
2024,
Volume and Issue:
1(FSE), P. 1585 - 1608
Published: July 12, 2024
Code
translation
tools,
namely
transpilers,
are
developed
for
automatic
source-to-source
translation.
Latest
learning-based
transpilers
have
shown
impressive
enhancement
against
rule-based
counterparts
in
both
accuracy
and
readability,
owing
to
their
task-specific
pre-training
on
extensive
monolingual
corpora.
Nevertheless,
current
performance
still
remains
unsatisfactory
practical
deployment,
the
associated
training
resources
also
prohibitively
expensive.
Large
Language
Models
(LLMs),
pre-trained
huge
amounts
of
human-written
code/text,
remarkable
many
code
intelligence
tasks
due
powerful
generality,
even
without
re-training/fine-tuning.
Thus,
LLMs
can
potentially
circumvent
above
limitations,
but
they
not
been
exhaustively
explored
yet.
This
paper
investigates
diverse
automated
tasks,
finding
that:
although
certain
outperformed
some
issues,
where
most
failures
induced
by
a
lack
comprehension
source
programs
(38.51%),
missing
clear
instructions
I/O
types
(14.94%),
ignoring
discrepancies
between
target
(41.38%).
Enlightened
findings,
we
further
propose
UniTrans
,
Uni
fied
Trans
lation
framework,
applicable
various
LLMs,
unleashing
power
this
field.
Specifically,
first
crafts
series
test
cases
with
assistance
programs.
Next,
it
harnesses
auto-generated
augment
then
evaluate
correctness
via
execution.
Afterward,
(iteratively)
repairs
incorrectly
translated
prompted
case
execution
results.
Extensive
experiments
conducted
six
settings
datasets
Python,
Java,
C++.
Three
recent
sizes,
including
GPT-3.5
LLaMA-13B/7B,
tested
all
achieve
substantial
improvements
terms
computational
exact
match
among
almost
settings,
showing
universal
effectiveness
practice.
Writing
tests
is
a
time-consuming
yet
essential
task
during
software
development.
We
propose
to
leverage
recent
advances
in
deep
learning
for
text
and
code
generation
assist
developers
writing
tests.
formalize
the
novel
of
test
completion
automatically
complete
next
statement
method
based
on
context
prior
statements
under
test.
develop
TECo-a
model
using
semantics
completion.
The
key
insight
underlying
TECO
that
predicting
requires
reasoning
about
execution,
which
hard
do
with
only
syntax-level
data
existing
models
use.
Teco
extracts
uses
six
kinds
data,
including
execution
result
method.
To
provide
testbed
this
new
task,
as
well
evaluate
TECO,
we
collect
corpus
130,934
methods
from
1,270
open-source
Java
projects.
Our
results
show
achieves
an
exact-match
accuracy
18,
29%
higher
than
best
baseline
only.
When
measuring
functional
correctness
generated
statement,
can
generate
runnable
cases
compared
18%
obtained
by
baseline.
Moreover,
sianificantly
better
work
oracle
generation.
Proceedings of the ACM on Programming Languages,
Journal Year:
2024,
Volume and Issue:
8(OOPSLA1), P. 1100 - 1124
Published: April 29, 2024
Students
often
make
mistakes
in
their
introductory
programming
assignments
as
part
of
learning
process.
Unfortunately,
providing
custom
repairs
for
these
can
require
a
substantial
amount
time
and
effort
from
class
instructors.
Automated
program
repair
(APR)
techniques
be
used
to
synthesize
such
fixes.
Prior
work
has
explored
the
use
symbolic
neural
APR
education
domain.
Both
types
approaches
either
engineering
efforts
or
large
amounts
data
training.
We
propose
language
model
trained
on
code,
Codex
(a
version
GPT),
build
an
system
--
PyDex
Python
assignments.
Our
fix
both
syntactic
semantic
by
combining
multi-modal
prompts,
iterative
querying,
test-case-based
selection
few-shots,
chunking.
evaluate
286
real
student
programs
compare
three
baselines,
including
one
that
combines
state-of-the-art
syntax
engine,
BIFI,
engine
assignments,
Refactory.
find
more
produce
smaller
patches
average.
Proceedings of the ACM on software engineering.,
Journal Year:
2024,
Volume and Issue:
1(FSE), P. 951 - 971
Published: July 12, 2024
Testing
plays
a
pivotal
role
in
ensuring
software
quality,
yet
conventional
Search
Based
Software
(SBST)
methods
often
struggle
with
complex
units,
achieving
suboptimal
test
coverage.
Recent
work
using
large
language
models
(LLMs)
for
generation
have
focused
on
improving
quality
through
optimizing
the
context
and
correcting
errors
model
outputs,
but
use
fixed
prompting
strategies
that
prompt
to
generate
tests
without
additional
guidance.
As
result
LLM-generated
testsuites
still
suffer
from
low
In
this
paper,
we
present
SymPrompt,
code-aware
strategy
LLMs
generation.
SymPrompt’s
approach
is
based
recent
demonstrates
can
solve
more
logical
problems
when
prompted
reason
about
problem
multi-step
fashion.
We
apply
methodology
by
deconstructing
testsuite
process
into
multi-stage
sequence,
each
of
which
driven
specific
aligned
execution
paths
method
under
test,
exposing
relevant
type
dependency
focal
model.
Our
enables
pretrained
complete
cases
any
training.
implement
SymPrompt
TreeSitter
parsing
framework
evaluate
benchmark
challenging
open
source
Python
projects.
enhances
correct
generations
factor
5
bolsters
relative
coverage
26%
CodeGen2.
Notably,
applied
GPT-4,
improves
over
2x
compared
baseline
strategies.