Towards More Realistic Evaluation for Neural Test Oracle Generation DOI Open Access
Zhongxin Liu, Kui Liu, Xin Xia

et al.

Published: July 12, 2023

Unit testing has become an essential practice during software development and maintenance. Effective unit tests can help guard improve quality but require a substantial amount of time effort to write maintain. A test consists prefix oracle. Synthesizing oracles, especially functional is well-known challenging problem. Recent studies proposed leverage neural models generate i.e., oracle generation (NTOG), obtained promising results. However, after systematic inspection, we find there are some inappropriate settings in existing evaluation methods for NTOG. These could mislead the understanding NTOG approaches' performance. We summarize them as 1) generating prefixes from bug-fixed program versions, 2) evaluating with unrealistic metric, 3) lacking straightforward baseline. In this paper, first investigate impacts these on performance approaches. that unrealistically versions inflates number bugs found by state-of-the-art approach TOGA 61.8%, FPR (False Positive Rate) not realistic metric Precision only 0.38%, baseline NoException, which simply expects no exception should be raised, 61% twice Precision. Furthermore, introduce additional ranking step propose named Found@K better measure cost-effectiveness approaches terms bug-finding. novel unsupervised method instantiate step, significantly improving TOGA. Eventually, based our experimental results observations, more TEval+ seven rules thumb boost into their practical usages.

Language: Английский

Automated Program Repair in the Era of Large Pre-trained Language Models DOI
Chunqiu Steven Xia, Yuxiang Wei, Lingming Zhang

et al.

Published: May 1, 2023

Automated Program Repair (APR) aims to help developers automatically patch software bugs. However, current state-of-the-art traditional and learning-based APR techniques face the problem of limited variety, failing fix complicated This is mainly due reliance on bug-fixing datasets craft templates (traditional) or directly predict potential patches (learning-based). Large Pre-Trained Language Models (LLMs), trained using billions text/code tokens, can potentially avoid this issue. Very recently, researchers have leveraged LLMs for without relying any datasets. Meanwhile, such existing work either failed include was not evaluated realistic Thus, true power modern important yet be revealed. In work, we perform first extensive study applying APR. We select 9 recent LLMs, including both generative infilling models, ranging from 125M 20B in size. designed 3 different repair settings evaluate ways use generate patches: 1) entire function, 2) fill a chunk code given prefix suffix 3) output single line fix. apply under these 5 across languages compare number bugs fixed, generation speed compilation rate. also against tools. Our demonstrates that already substantially outperform all our Among studied scaling effect exists where larger models tend achieve better performance. Also, show time after buggy (adopted infilling-style APR) only generating more fixes but with higher Besides generation, consider correct natural than other ones, even effective ranking correctness checking. Lastly, LLM-based further boosted via: increasing sample size, incorporating template information.

Language: Английский

Citations

146

CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models DOI
Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri

et al.

Published: May 1, 2023

Search-based software testing (SBST) generates high-coverage test cases for programs under with a combination of case generation and mutation. SBST's performance relies on there being reasonable probability generating that exercise the core logic program test. Given such cases, SBST can then explore space around them to various parts program. This paper explores whether Large Language Models (LLMs) code, as OpenAI's Codex, be used help exploration. Our proposed algorithm, CodaMosa, conducts until its coverage improvements stall, asks Codex provide example under-covered functions. These examples redirect search more useful areas space. On an evaluation over 486 benchmarks, CodaMosa achieves statistically significantly higher many benchmarks (173 279) than it reduces (10 4), compared LLM-only baselines.

Language: Английский

Citations

110

Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning DOI
Noor Nashid, Mifta Sintaha, Ali Mesbah

et al.

Published: May 1, 2023

Large language models trained on massive code corpora can generalize to new tasks without the need for task-specific fine-tuning. In few-shot learning, these take as input a prompt, composed of natural instructions, few instances task demonstration, and query generate an output. However, creation effective prompt code-related in learning has received little attention. We present technique that automatically retrieves demonstrations similar developer task, based embedding or frequency analysis. apply our approach, Cedar, two different programming languages, statically dynamically typed, tasks, namely, test assertion generation program repair. For each we compare Cedar with state-of-the-art fine-tuned models. The empirical results show that, only relevant demonstrations, is both accuracy 76% 52% exact matches repair respectively. generation, outperforms existing by 333% 11%, repair, yields 189% better than competitive recent These findings have practical implications practitioners, could potentially be applied multilingual multitask settings language-specific training minimal examples effort.

Language: Английский

Citations

69

An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation DOI
Max Schäfer, Sarah Nadi, Aryaz Eghbali

et al.

IEEE Transactions on Software Engineering, Journal Year: 2023, Volume and Issue: 50(1), P. 85 - 105

Published: Nov. 28, 2023

Unit tests play a key role in ensuring the correctness of software. However, manually creating unit is laborious task, motivating need for automation. Large Language Models (LLMs) have recently been applied to various aspects software development, including their suggested use automated generation tests, but while requiring additional training or few-shot learning on examples existing tests. This paper presents large-scale empirical evaluation effectiveness LLMs test without manual effort. Concretely, we consider an approach where LLM provided with prompts that include signature and implementation function under test, along usage extracted from documentation. Furthermore, if generated fails, our attempts generate new fixes problem by re-prompting model failing error message. We implement TestPilot , adaptive LLM-based tool JavaScript automatically generates methods given project's API. evaluate using OpenAI's gpt3.5-turbo 25 npm packages total 1,684 API functions. The achieve median statement coverage 70.2% branch 52.8%. In contrast, state-of-the feedback-directed technique, Nessie, achieves only 51.3% 25.6% coverage. experiments excluding parts information included show all components contribute towards effective suites. also find 92.8% 's $\leq$ 50% similarity (as measured normalized edit distance), none them being exact copies. Finally, run two LLMs, older code-cushman-002 StarCoder which process publicly documented. Overall, observed similar results former (68.2% coverage), somewhat worse latter (54.0% suggesting influenced size set LLM, does not fundamentally depend specific model.

Language: Английский

Citations

69

Effective test generation using pre-trained Large Language Models and mutation testing DOI
Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab

et al.

Information and Software Technology, Journal Year: 2024, Volume and Issue: 171, P. 107468 - 107468

Published: April 6, 2024

Language: Английский

Citations

19

Exploring and Unleashing the Power of Large Language Models in Automated Code Translation DOI
Zhen Yang, Fang Liu, Zhongxing Yu

et al.

Proceedings of the ACM on software engineering., Journal Year: 2024, Volume and Issue: 1(FSE), P. 1585 - 1608

Published: July 12, 2024

Code translation tools, namely transpilers, are developed for automatic source-to-source translation. Latest learning-based transpilers have shown impressive enhancement against rule-based counterparts in both accuracy and readability, owing to their task-specific pre-training on extensive monolingual corpora. Nevertheless, current performance still remains unsatisfactory practical deployment, the associated training resources also prohibitively expensive. Large Language Models (LLMs), pre-trained huge amounts of human-written code/text, remarkable many code intelligence tasks due powerful generality, even without re-training/fine-tuning. Thus, LLMs can potentially circumvent above limitations, but they not been exhaustively explored yet. This paper investigates diverse automated tasks, finding that: although certain outperformed some issues, where most failures induced by a lack comprehension source programs (38.51%), missing clear instructions I/O types (14.94%), ignoring discrepancies between target (41.38%). Enlightened findings, we further propose UniTrans , Uni fied Trans lation framework, applicable various LLMs, unleashing power this field. Specifically, first crafts series test cases with assistance programs. Next, it harnesses auto-generated augment then evaluate correctness via execution. Afterward, (iteratively) repairs incorrectly translated prompted case execution results. Extensive experiments conducted six settings datasets Python, Java, C++. Three recent sizes, including GPT-3.5 LLaMA-13B/7B, tested all achieve substantial improvements terms computational exact match among almost settings, showing universal effectiveness practice.

Language: Английский

Citations

9

Unit Test Generation using Generative AI : A Comparative Performance Analysis of Autogeneration Tools DOI
S. N. Bhatia, Tarushi Gandhi, Dhruv Kumar

et al.

Published: April 20, 2024

Language: Английский

Citations

9

Learning Deep Semantics for Test Completion DOI
Pengyu Nie, Rahul Banerjee, Junyi Jessy Li

et al.

Published: May 1, 2023

Writing tests is a time-consuming yet essential task during software development. We propose to leverage recent advances in deep learning for text and code generation assist developers writing tests. formalize the novel of test completion automatically complete next statement method based on context prior statements under test. develop TECo-a model using semantics completion. The key insight underlying TECO that predicting requires reasoning about execution, which hard do with only syntax-level data existing models use. Teco extracts uses six kinds data, including execution result method. To provide testbed this new task, as well evaluate TECO, we collect corpus 130,934 methods from 1,270 open-source Java projects. Our results show achieves an exact-match accuracy 18, 29% higher than best baseline only. When measuring functional correctness generated statement, can generate runnable cases compared 18% obtained by baseline. Moreover, sianificantly better work oracle generation.

Language: Английский

Citations

17

PyDex: Repairing Bugs in Introductory Python Assignments using LLMs DOI Open Access
Jialu Zhang, José Cambronero, Sumit Gulwani

et al.

Proceedings of the ACM on Programming Languages, Journal Year: 2024, Volume and Issue: 8(OOPSLA1), P. 1100 - 1124

Published: April 29, 2024

Students often make mistakes in their introductory programming assignments as part of learning process. Unfortunately, providing custom repairs for these can require a substantial amount time and effort from class instructors. Automated program repair (APR) techniques be used to synthesize such fixes. Prior work has explored the use symbolic neural APR education domain. Both types approaches either engineering efforts or large amounts data training. We propose language model trained on code, Codex (a version GPT), build an system -- PyDex Python assignments. Our fix both syntactic semantic by combining multi-modal prompts, iterative querying, test-case-based selection few-shots, chunking. evaluate 286 real student programs compare three baselines, including one that combines state-of-the-art syntax engine, BIFI, engine assignments, Refactory. find more produce smaller patches average.

Language: Английский

Citations

8

Code-Aware Prompting: A Study of Coverage-Guided Test Generation in Regression Setting using LLM DOI
Gabriel Ryan, Siddhartha Jain, Mingyue Shang

et al.

Proceedings of the ACM on software engineering., Journal Year: 2024, Volume and Issue: 1(FSE), P. 951 - 971

Published: July 12, 2024

Testing plays a pivotal role in ensuring software quality, yet conventional Search Based Software (SBST) methods often struggle with complex units, achieving suboptimal test coverage. Recent work using large language models (LLMs) for generation have focused on improving quality through optimizing the context and correcting errors model outputs, but use fixed prompting strategies that prompt to generate tests without additional guidance. As result LLM-generated testsuites still suffer from low In this paper, we present SymPrompt, code-aware strategy LLMs generation. SymPrompt’s approach is based recent demonstrates can solve more logical problems when prompted reason about problem multi-step fashion. We apply methodology by deconstructing testsuite process into multi-stage sequence, each of which driven specific aligned execution paths method under test, exposing relevant type dependency focal model. Our enables pretrained complete cases any training. implement SymPrompt TreeSitter parsing framework evaluate benchmark challenging open source Python projects. enhances correct generations factor 5 bolsters relative coverage 26% CodeGen2. Notably, applied GPT-4, improves over 2x compared baseline strategies.

Language: Английский

Citations

7