Test Oracle Automation in the Era of LLMs DOI Open Access
Facundo Molina, Alessandra Gorla, Marcelo d’Amorim

et al.

ACM Transactions on Software Engineering and Methodology, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 27, 2025

The effectiveness of a test suite in detecting faults highly depends on the quality its oracles. Large Language Models (LLMs) have demonstrated remarkable proficiency tackling diverse software testing tasks. This paper aims to present roadmap for future research use LLMs oracle automation. We discuss progress made field automation before introduction LLMs, identifying main limitations and weaknesses existing techniques. Additionally, we recent studies this task, highlighting challenges that arise from their use, e.g., how assess usefulness generated conclude with discussion about directions opportunities LLM-based

Language: Английский

An Empirical Study on the Usage of Transformer Models for Code Completion DOI Creative Commons
Matteo Ciniselli, Nathan Cooper, Luca Pascarella

et al.

IEEE Transactions on Software Engineering, Journal Year: 2021, Volume and Issue: unknown, P. 1 - 1

Published: Jan. 1, 2021

Code completion aims at speeding up code writing by predicting the next token(s) developer is likely to write. Works in this field focused on improving accuracy of generated predictions, with substantial leaps forward made possible deep learning (DL) models. However, techniques are mostly evaluated scenario token type, few exceptions pushing boundaries prediction an entire statement. Thus, little known about performance state-of-the-art approaches more challenging scenarios which, for example, block must be generated. We present a large-scale study exploring capabilities Transformer-based models supporting different granularity levels, including single tokens, one or multiple statements, blocks (e.g., iterated loop). experimented several variants two recently proposed models, namely RoBERTa and Text-To-Text Transfer Transformer (T5), task completion. The achieved results show that particular T5, represent viable solution completion, perfect predictions ranging from ~29%, obtained when asking model guess blocks, ~69%, reached simpler tokens masked same

Language: Английский

Citations

65

Generating accurate assert statements for unit test cases using pretrained transformers DOI
Michele Tufano, Dawn Drain, A. Svyatkovskiy

et al.

Published: May 17, 2022

Unit testing represents the foundational basis of software pyramid, beneath integration and end-to-end testing. Automated researchers have proposed a variety techniques to assist developers in this time-consuming task.

Language: Английский

Citations

53

TOGA DOI
Elizabeth Dinella, Gabriel Ryan,

Todd Mytkowicz

et al.

Proceedings of the 44th International Conference on Software Engineering, Journal Year: 2022, Volume and Issue: unknown

Published: May 21, 2022

Testing is widely recognized as an important stage of the software development lifecycle. Effective testing can provide benefits such bug finding, preventing regressions, and documentation. In terms documentation, unit tests express a unit's intended functionality, conceived by developer. A test oracle, typically expressed condition, documents behavior under given prefix. Synthesizing functional oracle challenging problem, it must capture functionality rather than implemented functionality. this paper, we propose TOGA (a neural method for Test Oracle GenerAtion), unified transformer-based approach to infer both exceptional assertion oracles based on context focal method. Our handle units with ambiguous or missing even implementation. We evaluate our inference accuracy bug-finding. technique improves 33\% over existing approaches, achieving 96\% overall held out dataset. Furthermore, show that when integrated automated generation tool (EvoSuite), finds 57 real world bugs in large-scale Java programs, including 30 are not found any other evaluation.

Language: Английский

Citations

53

Using Transfer Learning for Code-Related Tasks DOI
Antonio Mastropaolo, Nathan Cooper, David N. Palacio

et al.

IEEE Transactions on Software Engineering, Journal Year: 2022, Volume and Issue: 49(4), P. 1580 - 1598

Published: June 15, 2022

Deep learning (DL) techniques have been used to support several code-related tasks such as code summarization and bug-fixing. In particular, pre-trained transformer models are on the rise, also thanks excellent results they achieved in Natural Language Processing (NLP) tasks. The basic idea behind these is first pre-train them a generic dataset using self-supervised task (e.g., filling masked words sentences). Then, fine-tuned specific of interest language translation). A single model can be multiple tasks, possibly exploiting benefits transfer learning . This means that knowledge acquired solve translation) useful boost performance another sentiment classification). While transfer widely studied NLP, limited empirical evidence available when it comes this paper, we assess Text-To-Text Transfer Transformer (T5) supporting four different tasks: (i) automatic bug-fixing, (ii) injection mutants, (iii) generation assert statements, (iv) summarization. We pay particular attention studying role played by pre-training multi-task fine-tuning model's performance. show T5 achieve better compared state-of-the-art baselines; while helps model, not all benefit from fine-tuning.

Language: Английский

Citations

41

An Empirical Comparison of Pre-Trained Models of Source Code DOI
Changan Niu, Chuanyi Li, Vincent Ng

et al.

Published: May 1, 2023

While a large number of pre-trained models source code have been successfully developed and applied to variety software engineering (SE) tasks in recent years, our understanding these is arguably fairly limited. With the goal advancing models, we perform first systematic empirical comparison 19 recently-developed on 13 SE tasks. To gain additional insights into adopt recently -developed 4-dimensional categorization subsequently investigate whether there are correlations between different categories their performances

Language: Английский

Citations

35

A3Test: Assertion-Augmented Automated Test case generation DOI Creative Commons
Saranya Alagarsamy, Chakkrit Tantithamthavorn, Aldeida Aleti

et al.

Information and Software Technology, Journal Year: 2024, Volume and Issue: 176, P. 107565 - 107565

Published: Aug. 30, 2024

Language: Английский

Citations

14

Evaluating and Improving ChatGPT for Unit Test Generation DOI
Zhiqiang Yuan, Mingwei Liu, Shiji Ding

et al.

Proceedings of the ACM on software engineering., Journal Year: 2024, Volume and Issue: 1(FSE), P. 1703 - 1726

Published: July 12, 2024

Unit testing plays an essential role in detecting bugs functionally-discrete program units (e.g., methods). Manually writing high-quality unit tests is time-consuming and laborious. Although the traditional techniques are able to generate with reasonable coverage, they shown exhibit low readability still cannot be directly adopted by developers practice. Recent work has large potential of language models (LLMs) test generation. By being pre-trained on a massive developer-written code corpus, capable generating more human-like meaningful code. In this work, we perform first empirical study evaluate capability ChatGPT (i.e., one most representative LLMs outstanding performance generation comprehension) particular, conduct both quantitative analysis user systematically investigate quality its generated terms correctness, sufficiency, readability, usability. We find that suffer from correctness issues, including diverse compilation errors execution failures (mostly caused incorrect assertions); but passing almost resemble manually-written achieving comparable even sometimes developers' preference. Our findings indicate could very promising if further improved. Inspired our above, propose ChatTester, novel ChatGPT-based approach, which leverages itself improve tests. ChatTester incorporates initial generator iterative refiner. evaluation demonstrates effectiveness 34.3% compilable 18.7% correct assertions than default ChatGPT. addition ChatGPT, generalization capabilities applying it two recent open-source CodeLLama-Instruct CodeFuse) results show can also these LLMs.

Language: Английский

Citations

11

An Empirical Study on the Usage of BERT Models for Code Completion DOI
Matteo Ciniselli, Nathan Cooper, Luca Pascarella

et al.

Published: May 1, 2021

Code completion is one of the main features modern Integrated Development Environments (IDEs). Its objective to speed up code writing by predicting next token(s) developer likely write. Research in this area has substantially bolstered predictive performance these techniques. However, support developers still limited prediction few tokens type. In work, we take a step further direction presenting large-scale empirical study aimed at exploring capabilities state-of-the-art deep learning (DL) models supporting different granularity levels, including single tokens, or multiple entire statements, blocks (e.g., iterated block for loop). To aim, train and test several adapted variants recently proposed RoBERTa model, evaluate its predictions from perspectives, including: (i) metrics usually adopted when assessing DL generative (i.e., BLEU score Levenshtein distance); (ii) percentage perfect predicted snippets that match those written developers); (iii) "semantic" equivalence generated as compared developers. The achieved results show BERT represent viable solution completion, with ranging ~7%, obtained asking model guess blocks, ~58%, reached simpler scenario masked same statement.

Language: Английский

Citations

55

Automated assertion generation via information retrieval and its integration with deep learning DOI
Hao Yu, Yiling Lou, Ke Sun

et al.

Proceedings of the 44th International Conference on Software Engineering, Journal Year: 2022, Volume and Issue: unknown, P. 163 - 174

Published: May 21, 2022

Unit testing could be used to validate the correctness of basic units software system under test. To reduce manual efforts in conducting unit testing, research community has contributed with tools that automatically generate test cases, including inputs and oracles (e.g., assertions). Recently, ATLAS, a deep learning (DL) based approach, was proposed assertions for on other already written tests. Despite promising, effectiveness ATLAS is still limited. improve effectiveness, this work, we make first attempt leverage Information Retrieval (IR) assertion generation propose an IR-based technique retrieval retrieved-assertion adaptation. In addition, integration approach combine our DL-based ATLAS) further effectiveness. Our experimental results show outperforms state-of-the-art integrating can achieve higher accuracy. convey important message information competitive worthwhile pursue engineering tasks such as generation, should seriously considered by given recent years solutions have been over-popularly adopted tasks.

Language: Английский

Citations

31

Learning Deep Semantics for Test Completion DOI
Pengyu Nie, Rahul Banerjee, Junyi Jessy Li

et al.

Published: May 1, 2023

Writing tests is a time-consuming yet essential task during software development. We propose to leverage recent advances in deep learning for text and code generation assist developers writing tests. formalize the novel of test completion automatically complete next statement method based on context prior statements under test. develop TECo-a model using semantics completion. The key insight underlying TECO that predicting requires reasoning about execution, which hard do with only syntax-level data existing models use. Teco extracts uses six kinds data, including execution result method. To provide testbed this new task, as well evaluate TECO, we collect corpus 130,934 methods from 1,270 open-source Java projects. Our results show achieves an exact-match accuracy 18, 29% higher than best baseline only. When measuring functional correctness generated statement, can generate runnable cases compared 18% obtained by baseline. Moreover, sianificantly better work oracle generation.

Language: Английский

Citations

17