Test Oracle Automation in the Era of LLMs DOI Open Access
Facundo Molina, Alessandra Gorla, Marcelo d’Amorim

et al.

ACM Transactions on Software Engineering and Methodology, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 27, 2025

The effectiveness of a test suite in detecting faults highly depends on the quality its oracles. Large Language Models (LLMs) have demonstrated remarkable proficiency tackling diverse software testing tasks. This paper aims to present roadmap for future research use LLMs oracle automation. We discuss progress made field automation before introduction LLMs, identifying main limitations and weaknesses existing techniques. Additionally, we recent studies this task, highlighting challenges that arise from their use, e.g., how assess usefulness generated conclude with discussion about directions opportunities LLM-based

Language: Английский

Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks DOI
Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper

et al.

Published: May 1, 2021

Deep learning (DL) techniques are gaining more and attention in the software engineering community. They have been used to support several code-related tasks, such as automatic bug fixing code comments generation. Recent studies Natural Language Processing (NLP) field shown that Text-To-Text Transfer Transformer (T5) architecture can achieve state-of-the-art performance for a variety of NLP tasks. The basic idea behind T5 is first pre-train model on large generic dataset using self-supervised task (e.g., filling masked words sentences). Once pre-trained, it fine-tuned smaller specialized datasets, each one related specific language translation, sentence classification). In this paper, we empirically investigate how performs when pre-trained We composed natural English text source code. Then, fine-tune by reusing datasets four previous works DL to: (i) fix bugs, (ii) inject mutants, (iii) generate assert statements, (iv) comments. compared single with results reported original papers proposing DL-based solutions those show our model, exploiting additional data pre-training phase, improvements over baselines.

Language: Английский

Citations

176

Automated Program Repair in the Era of Large Pre-trained Language Models DOI
Chunqiu Steven Xia, Yuxiang Wei, Lingming Zhang

et al.

Published: May 1, 2023

Automated Program Repair (APR) aims to help developers automatically patch software bugs. However, current state-of-the-art traditional and learning-based APR techniques face the problem of limited variety, failing fix complicated This is mainly due reliance on bug-fixing datasets craft templates (traditional) or directly predict potential patches (learning-based). Large Pre-Trained Language Models (LLMs), trained using billions text/code tokens, can potentially avoid this issue. Very recently, researchers have leveraged LLMs for without relying any datasets. Meanwhile, such existing work either failed include was not evaluated realistic Thus, true power modern important yet be revealed. In work, we perform first extensive study applying APR. We select 9 recent LLMs, including both generative infilling models, ranging from 125M 20B in size. designed 3 different repair settings evaluate ways use generate patches: 1) entire function, 2) fill a chunk code given prefix suffix 3) output single line fix. apply under these 5 across languages compare number bugs fixed, generation speed compilation rate. also against tools. Our demonstrates that already substantially outperform all our Among studied scaling effect exists where larger models tend achieve better performance. Also, show time after buggy (adopted infilling-style APR) only generating more fixes but with higher Besides generation, consider correct natural than other ones, even effective ranking correctness checking. Lastly, LLM-based further boosted via: increasing sample size, incorporating template information.

Language: Английский

Citations

146

CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models DOI
Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri

et al.

Published: May 1, 2023

Search-based software testing (SBST) generates high-coverage test cases for programs under with a combination of case generation and mutation. SBST's performance relies on there being reasonable probability generating that exercise the core logic program test. Given such cases, SBST can then explore space around them to various parts program. This paper explores whether Large Language Models (LLMs) code, as OpenAI's Codex, be used help exploration. Our proposed algorithm, CodaMosa, conducts until its coverage improvements stall, asks Codex provide example under-covered functions. These examples redirect search more useful areas space. On an evaluation over 486 benchmarks, CodaMosa achieves statistically significantly higher many benchmarks (173 279) than it reduces (10 4), compared LLM-only baselines.

Language: Английский

Citations

110

Using pre-trained models to boost code review automation DOI
Rosalia Tufano,

Simone Masiero,

Antonio Mastropaolo

et al.

Proceedings of the 44th International Conference on Software Engineering, Journal Year: 2022, Volume and Issue: unknown, P. 2291 - 2302

Published: May 21, 2022

Code review is a practice widely adopted in open source and industrial projects. Given the non-negligible cost of such process, researchers started investigating possibility automating specific code tasks. We recently proposed Deep Learning (DL) models targeting automation two tasks: first model takes as input submitted for implements it changes likely to be recommended by reviewer; second reviewer comment posted natural language automatically change required reviewer. While preliminary results we achieved are encouraging, both had been tested rather simple scenarios, substantially simplifying targeted problem. This was also due choices made when designing technique experiments. In this paper, build on top that work demonstrating pre-trained Text-To-Text Transfer Transformer (T5) can outperform previous DL Also, conducted our experiments larger more realistic (and challenging) dataset activities.

Language: Английский

Citations

88

Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning DOI
Noor Nashid, Mifta Sintaha, Ali Mesbah

et al.

Published: May 1, 2023

Large language models trained on massive code corpora can generalize to new tasks without the need for task-specific fine-tuning. In few-shot learning, these take as input a prompt, composed of natural instructions, few instances task demonstration, and query generate an output. However, creation effective prompt code-related in learning has received little attention. We present technique that automatically retrieves demonstrations similar developer task, based embedding or frequency analysis. apply our approach, Cedar, two different programming languages, statically dynamically typed, tasks, namely, test assertion generation program repair. For each we compare Cedar with state-of-the-art fine-tuned models. The empirical results show that, only relevant demonstrations, is both accuracy 76% 52% exact matches repair respectively. generation, outperforms existing by 333% 11%, repair, yields 189% better than competitive recent These findings have practical implications practitioners, could potentially be applied multilingual multitask settings language-specific training minimal examples effort.

Language: Английский

Citations

69

An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation DOI
Max Schäfer, Sarah Nadi, Aryaz Eghbali

et al.

IEEE Transactions on Software Engineering, Journal Year: 2023, Volume and Issue: 50(1), P. 85 - 105

Published: Nov. 28, 2023

Unit tests play a key role in ensuring the correctness of software. However, manually creating unit is laborious task, motivating need for automation. Large Language Models (LLMs) have recently been applied to various aspects software development, including their suggested use automated generation tests, but while requiring additional training or few-shot learning on examples existing tests. This paper presents large-scale empirical evaluation effectiveness LLMs test without manual effort. Concretely, we consider an approach where LLM provided with prompts that include signature and implementation function under test, along usage extracted from documentation. Furthermore, if generated fails, our attempts generate new fixes problem by re-prompting model failing error message. We implement TestPilot , adaptive LLM-based tool JavaScript automatically generates methods given project's API. evaluate using OpenAI's gpt3.5-turbo 25 npm packages total 1,684 API functions. The achieve median statement coverage 70.2% branch 52.8%. In contrast, state-of-the feedback-directed technique, Nessie, achieves only 51.3% 25.6% coverage. experiments excluding parts information included show all components contribute towards effective suites. also find 92.8% 's $\leq$ 50% similarity (as measured normalized edit distance), none them being exact copies. Finally, run two LLMs, older code-cushman-002 StarCoder which process publicly documented. Overall, observed similar results former (68.2% coverage), somewhat worse latter (54.0% suggesting influenced size set LLM, does not fundamentally depend specific model.

Language: Английский

Citations

69

On the Robustness of Code Generation Techniques: An Empirical Study on GitHub Copilot DOI
Antonio Mastropaolo, Luca Pascarella,

Emanuela Guglielmi

et al.

Published: May 1, 2023

Software engineering research has always being concerned with the improvement of code completion approaches, which suggest next tokens a developer will likely type while coding. The release GitHub Copilot constitutes big step forward, also because its unprecedented ability to automatically generate even entire functions from their natural language description. While usefulness is evident, it still unclear what extent robust. Specifically, we do not know semantic-preserving changes in description provided model have an effect on generated function. In this paper present empirical study aim at understanding whether different but semantically equivalent descriptions result same recommended A negative answer would pose questions robustness deep learning (DL)-based generators since imply that developers using wordings describe obtain recommendations. We asked 892 Java methods starting original Javadoc Then, for each method both manually and automatically, analyzed predictions by changed. Our results show modifying recommendations ∼46% cases. Also, differences might impact correctness (±28%).

Language: Английский

Citations

54

ChatGPT vs SBST: A Comparative Assessment of Unit Test Suite Generation DOI
Yutian Tang,

Zhijie Liu,

Zhichao Zhou

et al.

IEEE Transactions on Software Engineering, Journal Year: 2024, Volume and Issue: 50(6), P. 1340 - 1359

Published: March 29, 2024

Recent advancements in large language models (LLMs) have demonstrated exceptional success a wide range of general domain tasks, such as question answering and following instructions. Moreover, LLMs shown potential various software engineering applications. In this study, we present systematic comparison test suites generated by the ChatGPT LLM state-of-the-art SBST tool EvoSuite. Our is based on several critical factors, including correctness, readability, code coverage, bug detection capability. By highlighting strengths weaknesses (specifically ChatGPT) generating unit cases compared to EvoSuite, work provides valuable insights into performance solving problems. Overall, our findings underscore pave way for further research area.

Language: Английский

Citations

22

TypeWriter: neural type prediction with search-based validation DOI
Michael Pradel, Georgios Gousios, Jason Liu

et al.

Published: Nov. 8, 2020

Maintaining large code bases written in dynamically typed languages, such as JavaScript or Python, can be challenging due to the absence of type annotations: simple data compatibility errors proliferate, IDE support is limited, and APIs are hard comprehend. Recent work attempts address those issues through either static inference probabilistic prediction. Unfortunately, for dynamic languages inherently while approaches suffer from imprecision. This paper presents TypeWriter, first combination prediction with search-based refinement predicted types. TypeWriter's predictor learns infer return argument types functions partially annotated by combining natural language properties programming language-level information. To validate types, TypeWriter invokes a gradual checker different combinations navigating space possible feedback-directed manner. We implement approach Python evaluate it on two corpora: multi-million line base at Facebook collection 1,137 popular open-source projects. show that achieves an F1 score 0.64 (0.79) top-1 (top-5) predictions 0.57 (0.80) which clearly outperforms prior models. By validation, fully annotate between 14% 44% files randomly selected corpus, ensuring correctness. A comparison tool shows adds many more non-trivial currently suggests developers several thousands have already been accepted minimal changes.

Language: Английский

Citations

82

Towards Automating Code Review Activities DOI
Rosalia Tufano, Luca Pascarella, Michele Tufano

et al.

Published: May 1, 2021

Code reviews are popular in both industrial and open source projects. The benefits of code widely recognized include better quality lower likelihood introducing bugs. However, since review is a manual activity it comes at the cost spending developers' time on reviewing their teammates' code. Our goal to make first step towards partially automating process, thus, possibly reducing costs associated with it. We focus contributor xmlns:xlink="http://www.w3.org/1999/xlink">reviewer sides by training two different Deep Learning architectures. one learns changes performed developers during real activities, thus providing revised version her implementing transformations usually recommended before even submitted for review. second automatically provides commenting comments expressed natural language. empirical evaluation models shows that, contributor side, trained model succeeds replicating applied up 16% cases. On reviewer can correctly implement comment provided language 31% While these results encouraging, more research needed usable developers.

Language: Английский

Citations

80