UTFix: Change Aware Unit Test Repairing using LLM DOI Open Access
Shanto Rahman, Sachit Kuhar, Berk Çirişci

et al.

Proceedings of the ACM on Programming Languages, Journal Year: 2025, Volume and Issue: 9(OOPSLA1), P. 143 - 168

Published: April 9, 2025

Software updates, including bug repair and feature additions, are frequent in modern applications but they often leave test suites outdated, resulting undetected bugs increased chances of system failures. A recent study by Meta revealed that 14%-22% software failures stem from outdated tests fail to reflect changes the codebase. This highlights need keep sync with code ensure reliability. In this paper, we present UTFix, a novel approach for repairing unit when their corresponding focal methods undergo changes. UTFix addresses two critical issues: assertion failure reduced coverage caused method. Our leverages language models providing contextual information such as static slices, dynamic messages. We evaluate on our generated synthetic benchmarks (Tool-Bench), real-world benchmarks. Tool- Bench includes diverse popular open-source Python GitHub projects, where successfully repaired 89.2% achieved 100% 96 out 369 tests. On benchmarks, repairs 60% while achieving 19 30 To best knowledge, is first comprehensive focused evolving projects. contributions include development creation Tool-Bench demonstration effectiveness LLM-based addressing due evolution.

Language: Английский

Automated Program Repair in the Era of Large Pre-trained Language Models DOI
Chunqiu Steven Xia, Yuxiang Wei, Lingming Zhang

et al.

Published: May 1, 2023

Automated Program Repair (APR) aims to help developers automatically patch software bugs. However, current state-of-the-art traditional and learning-based APR techniques face the problem of limited variety, failing fix complicated This is mainly due reliance on bug-fixing datasets craft templates (traditional) or directly predict potential patches (learning-based). Large Pre-Trained Language Models (LLMs), trained using billions text/code tokens, can potentially avoid this issue. Very recently, researchers have leveraged LLMs for without relying any datasets. Meanwhile, such existing work either failed include was not evaluated realistic Thus, true power modern important yet be revealed. In work, we perform first extensive study applying APR. We select 9 recent LLMs, including both generative infilling models, ranging from 125M 20B in size. designed 3 different repair settings evaluate ways use generate patches: 1) entire function, 2) fill a chunk code given prefix suffix 3) output single line fix. apply under these 5 across languages compare number bugs fixed, generation speed compilation rate. also against tools. Our demonstrates that already substantially outperform all our Among studied scaling effect exists where larger models tend achieve better performance. Also, show time after buggy (adopted infilling-style APR) only generating more fixes but with higher Besides generation, consider correct natural than other ones, even effective ranking correctness checking. Lastly, LLM-based further boosted via: increasing sample size, incorporating template information.

Language: Английский

Citations

154

CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models DOI
Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri

et al.

Published: May 1, 2023

Search-based software testing (SBST) generates high-coverage test cases for programs under with a combination of case generation and mutation. SBST's performance relies on there being reasonable probability generating that exercise the core logic program test. Given such cases, SBST can then explore space around them to various parts program. This paper explores whether Large Language Models (LLMs) code, as OpenAI's Codex, be used help exploration. Our proposed algorithm, CodaMosa, conducts until its coverage improvements stall, asks Codex provide example under-covered functions. These examples redirect search more useful areas space. On an evaluation over 486 benchmarks, CodaMosa achieves statistically significantly higher many benchmarks (173 279) than it reduces (10 4), compared LLM-only baselines.

Language: Английский

Citations

115

An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation DOI
Max Schäfer, Sarah Nadi, Aryaz Eghbali

et al.

IEEE Transactions on Software Engineering, Journal Year: 2023, Volume and Issue: 50(1), P. 85 - 105

Published: Nov. 28, 2023

Unit tests play a key role in ensuring the correctness of software. However, manually creating unit is laborious task, motivating need for automation. Large Language Models (LLMs) have recently been applied to various aspects software development, including their suggested use automated generation tests, but while requiring additional training or few-shot learning on examples existing tests. This paper presents large-scale empirical evaluation effectiveness LLMs test without manual effort. Concretely, we consider an approach where LLM provided with prompts that include signature and implementation function under test, along usage extracted from documentation. Furthermore, if generated fails, our attempts generate new fixes problem by re-prompting model failing error message. We implement TestPilot , adaptive LLM-based tool JavaScript automatically generates methods given project's API. evaluate using OpenAI's gpt3.5-turbo 25 npm packages total 1,684 API functions. The achieve median statement coverage 70.2% branch 52.8%. In contrast, state-of-the feedback-directed technique, Nessie, achieves only 51.3% 25.6% coverage. experiments excluding parts information included show all components contribute towards effective suites. also find 92.8% 's $\leq$ 50% similarity (as measured normalized edit distance), none them being exact copies. Finally, run two LLMs, older code-cushman-002 StarCoder which process publicly documented. Overall, observed similar results former (68.2% coverage), somewhat worse latter (54.0% suggesting influenced size set LLM, does not fundamentally depend specific model.

Language: Английский

Citations

73

Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning DOI
Noor Nashid, Mifta Sintaha, Ali Mesbah

et al.

Published: May 1, 2023

Large language models trained on massive code corpora can generalize to new tasks without the need for task-specific fine-tuning. In few-shot learning, these take as input a prompt, composed of natural instructions, few instances task demonstration, and query generate an output. However, creation effective prompt code-related in learning has received little attention. We present technique that automatically retrieves demonstrations similar developer task, based embedding or frequency analysis. apply our approach, Cedar, two different programming languages, statically dynamically typed, tasks, namely, test assertion generation program repair. For each we compare Cedar with state-of-the-art fine-tuned models. The empirical results show that, only relevant demonstrations, is both accuracy 76% 52% exact matches repair respectively. generation, outperforms existing by 333% 11%, repair, yields 189% better than competitive recent These findings have practical implications practitioners, could potentially be applied multilingual multitask settings language-specific training minimal examples effort.

Language: Английский

Citations

70

Effective test generation using pre-trained Large Language Models and mutation testing DOI
Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab

et al.

Information and Software Technology, Journal Year: 2024, Volume and Issue: 171, P. 107468 - 107468

Published: April 6, 2024

Language: Английский

Citations

23

Exploring and Unleashing the Power of Large Language Models in Automated Code Translation DOI
Zhen Yang, Fang Liu, Zhongxing Yu

et al.

Proceedings of the ACM on software engineering., Journal Year: 2024, Volume and Issue: 1(FSE), P. 1585 - 1608

Published: July 12, 2024

Code translation tools, namely transpilers, are developed for automatic source-to-source translation. Latest learning-based transpilers have shown impressive enhancement against rule-based counterparts in both accuracy and readability, owing to their task-specific pre-training on extensive monolingual corpora. Nevertheless, current performance still remains unsatisfactory practical deployment, the associated training resources also prohibitively expensive. Large Language Models (LLMs), pre-trained huge amounts of human-written code/text, remarkable many code intelligence tasks due powerful generality, even without re-training/fine-tuning. Thus, LLMs can potentially circumvent above limitations, but they not been exhaustively explored yet. This paper investigates diverse automated tasks, finding that: although certain outperformed some issues, where most failures induced by a lack comprehension source programs (38.51%), missing clear instructions I/O types (14.94%), ignoring discrepancies between target (41.38%). Enlightened findings, we further propose UniTrans , Uni fied Trans lation framework, applicable various LLMs, unleashing power this field. Specifically, first crafts series test cases with assistance programs. Next, it harnesses auto-generated augment then evaluate correctness via execution. Afterward, (iteratively) repairs incorrectly translated prompted case execution results. Extensive experiments conducted six settings datasets Python, Java, C++. Three recent sizes, including GPT-3.5 LLaMA-13B/7B, tested all achieve substantial improvements terms computational exact match among almost settings, showing universal effectiveness practice.

Language: Английский

Citations

10

Unit Test Generation using Generative AI : A Comparative Performance Analysis of Autogeneration Tools DOI
S. N. Bhatia, Tarushi Gandhi, Dhruv Kumar

et al.

Published: April 20, 2024

Language: Английский

Citations

10

PyDex: Repairing Bugs in Introductory Python Assignments using LLMs DOI Open Access
Jialu Zhang, José Cambronero, Sumit Gulwani

et al.

Proceedings of the ACM on Programming Languages, Journal Year: 2024, Volume and Issue: 8(OOPSLA1), P. 1100 - 1124

Published: April 29, 2024

Students often make mistakes in their introductory programming assignments as part of learning process. Unfortunately, providing custom repairs for these can require a substantial amount time and effort from class instructors. Automated program repair (APR) techniques be used to synthesize such fixes. Prior work has explored the use symbolic neural APR education domain. Both types approaches either engineering efforts or large amounts data training. We propose language model trained on code, Codex (a version GPT), build an system -- PyDex Python assignments. Our fix both syntactic semantic by combining multi-modal prompts, iterative querying, test-case-based selection few-shots, chunking. evaluate 286 real student programs compare three baselines, including one that combines state-of-the-art syntax engine, BIFI, engine assignments, Refactory. find more produce smaller patches average.

Language: Английский

Citations

9

Code-Aware Prompting: A Study of Coverage-Guided Test Generation in Regression Setting using LLM DOI
Gabriel Ryan, Siddhartha Jain, Mingyue Shang

et al.

Proceedings of the ACM on software engineering., Journal Year: 2024, Volume and Issue: 1(FSE), P. 951 - 971

Published: July 12, 2024

Testing plays a pivotal role in ensuring software quality, yet conventional Search Based Software (SBST) methods often struggle with complex units, achieving suboptimal test coverage. Recent work using large language models (LLMs) for generation have focused on improving quality through optimizing the context and correcting errors model outputs, but use fixed prompting strategies that prompt to generate tests without additional guidance. As result LLM-generated testsuites still suffer from low In this paper, we present SymPrompt, code-aware strategy LLMs generation. SymPrompt’s approach is based recent demonstrates can solve more logical problems when prompted reason about problem multi-step fashion. We apply methodology by deconstructing testsuite process into multi-stage sequence, each of which driven specific aligned execution paths method under test, exposing relevant type dependency focal model. Our enables pretrained complete cases any training. implement SymPrompt TreeSitter parsing framework evaluate benchmark challenging open source Python projects. enhances correct generations factor 5 bolsters relative coverage 26% CodeGen2. Notably, applied GPT-4, improves over 2x compared baseline strategies.

Language: Английский

Citations

7

Test Oracle Automation in the Era of LLMs DOI Open Access
Facundo Molina, Alessandra Gorla, Marcelo d’Amorim

et al.

ACM Transactions on Software Engineering and Methodology, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 27, 2025

The effectiveness of a test suite in detecting faults highly depends on the quality its oracles. Large Language Models (LLMs) have demonstrated remarkable proficiency tackling diverse software testing tasks. This paper aims to present roadmap for future research use LLMs oracle automation. We discuss progress made field automation before introduction LLMs, identifying main limitations and weaknesses existing techniques. Additionally, we recent studies this task, highlighting challenges that arise from their use, e.g., how assess usefulness generated conclude with discussion about directions opportunities LLM-based

Language: Английский

Citations

1