Patch correctness assessment in automated program repair based on the impact of patches on production and test code DOI Open Access
Ali Ghanbari, Andrian Marcus

Published: July 15, 2022

Test-based generate-and-validate automated program repair (APR) systems often generate many patches that pass the test suite without fixing bug. The generated must be manually inspected by developers, so previous research proposed various techniques for automatic correctness assessment of APR-generated patches. Among them, dynamic patch rely on assumption that, when running originally passing cases, correct will not alter behavior in a significant way, e.g., removing code implementing functionality program. In this paper, we propose and evaluate novel technique, named Shibboleth, test-based APR systems. Unlike existing works, impact is captured along three complementary facets, allowing more effective assessment. Specifically, measure both production (via syntactic semantic similarity) coverage tests) to separate result similar programs do delete desired elements. Shibboleth assesses via ranking classification. We evaluated 1,871 patches, 29 Java-based Defects4J programs. technique outperforms state-of-the-art classification techniques. our data set, 43% (66%) ranks top-1 (top-2) positions, mode applied it achieves an accuracy F1-score 0.887 0.852, respectively.

Language: Английский

An Analysis of the Automatic Bug Fixing Performance of ChatGPT DOI
Dominik Sobania, Martin Briesch, Carol Hanna

et al.

Published: May 1, 2023

To support software developers in finding and fixing bugs, several automated program repair techniques have been introduced. Given a test suite, standard methods usually either synthesize repair, or navigate search space of edits to find test-suite passing variants. Recent are based on deep learning approaches. One these novel methods, which is not primarily intended for but still suitable it, ChatGPT. The bug performance ChatGPT, however, so far unclear. Therefore, this paper we evaluate ChatGPT the benchmark set, QuixBugs, compare with results other approaches reported literature. We that ChatGPT's competitive common CoCoNut Codex notably better than In contrast previous approaches, offers dialogue system through further information, e.g., expected output certain input an observed error message, can be entered. By providing such hints its success rate increased, 31 out 40 outperforming state-of-the-art.

Language: Английский

Citations

213

FixMiner: Mining relevant fix patterns for automated program repair DOI
Anil Koyuncu, Kui Liu, Tegawendé F. Bissyandé

et al.

Empirical Software Engineering, Journal Year: 2020, Volume and Issue: 25(3), P. 1980 - 2024

Published: March 14, 2020

Language: Английский

Citations

190

CURE: Code-Aware Neural Machine Translation for Automatic Program Repair DOI
Nan Jiang, Thibaud Lutellier, Lin Tan

et al.

Published: May 1, 2021

Automatic program repair (APR) is crucial to improve software reliability. Recently, neural machine translation (NMT) techniques have been used automatically fix bugs. While promising, these approaches two major limitations. Their search space often does not contain the correct fix, and their strategy ignores knowledge such as strict code syntax. Due limitations, existing NMT-based underperform best template-based approaches. We propose CURE, a new APR technique with three novelties. First, CURE pre-trains programming language (PL) model on large codebase learn developer-like source before task. Second, designs code-aware that finds more fixes by focusing searching for compilable patches are close in length buggy code. Finally, uses subword tokenization generate smaller contains fixes. Our evaluation widely-used benchmarks shows correctly 57 Defects4J bugs 26 QuixBugs bugs, outperforming all both benchmarks.

Language: Английский

Citations

188

Impact of Code Language Models on Automated Program Repair DOI
Nan Jiang, Kevin Liu, Thibaud Lutellier

et al.

Published: May 1, 2023

Automated program repair (APR) aims to help developers improve software reliability by generating patches for buggy programs. Although many code language models (CLM) are developed and effective in tasks such as completion, there has been little comprehensive, in-depth work evaluate CLMs' fixing capabilities fine-tune CLMs the APR task. Firstly, this is first ten on four benchmarks, which shows that surprisingly, best CLM, is, fixes 72% more bugs than state-of-the-art deep-learning (DL)-based techniques. Secondly, one of benchmarks was created us paper avoid data leaking a fair evaluation. Thirdly, it with training data, fine-tuning brings 31%-1,267% improvement enables them fix 46%-164 % existing DL-based Fourthly, studies impact lines, showing CLMs, cannot make good use lines bugs, yet fine-tuned could potentially over-rely lines. Lastly, analyzes size, time, memory efficiency different CLMs. This promising directions domain, APR-specific designs, also raises awareness comprehensive evaluations calls transparent reporting open-source repositories used pre-training address problem.

Language: Английский

Citations

80

Empirical review of Java program repair tools: a large-scale experiment on 2,141 bugs and 23,551 repair attempts DOI
Thomas Durieux, Fernanda Madeiral, Matías Martínez

et al.

Published: Aug. 9, 2019

In the past decade, research on test-suite-based automatic program repair has grown significantly. Each year, new approaches and implementations are featured in major software engineering venues. However, most of those evaluated a single benchmark bugs, which also rarely reproduced by other researchers. this paper, we present large-scale experiment using 11 Java tools 2,141 bugs from 5 benchmarks. Our goal is to have better understanding current state large diversity investigation guided hypothesis that repairability might not be generalized across different We found 1) able generate patches for 21% benchmarks, 2) performance Defects4J compared generating 47% 10-30% comprises 23,551 attempts, used find causes non-patch generation. These reported can help tool designers improve their tools.

Language: Английский

Citations

115

On the efficiency of test suite based program repair DOI
Kui Liu, Shangwen Wang, Anil Koyuncu

et al.

Published: June 27, 2020

Test-based automated program repair has been a prolific field of research in software engineering the last decade. Many approaches have indeed proposed, which leverage test suites as weak, but affordable, approximation to specifications. Although literature regularly sets new records on number benchmark bugs that can be fixed, several studies increasingly raise concerns about limitations and biases state-of-the-art approaches. For example, correctness generated patches questioned studies, while other researchers pointed out evaluation schemes may misleading with respect processing fault localization results. Nevertheless, there is little work addressing efficiency patch generation, regard practicality repair. In this paper, we fill gap literature, by providing an extensive review suite based Our objective assess candidates, since information correlated (1) strategy traverse search space efficiently order select sensical attempts, (2) minimize effort for identifying plausible patch, (3) well prioritize generation correct patch. To end, perform large-scale empirical study efficiency, terms quantity candidates 16 open-source tools Java programs. The experiments are carefully conducted under same configurations limit biases.

Language: Английский

Citations

100

Context-Aware Code Change Embedding for Better Patch Correctness Assessment DOI
Bo Lin, Shangwen Wang, Ming Wen

et al.

ACM Transactions on Software Engineering and Methodology, Journal Year: 2022, Volume and Issue: 31(3), P. 1 - 29

Published: May 18, 2022

Despite the capability in successfully fixing more and real-world bugs, existing Automated Program Repair (APR) techniques are still challenged by long-standing overfitting problem (i.e., a generated patch that passes all tests is actually incorrect). Plenty of approaches have been proposed for automated correctness assessment (APCA ). Nonetheless, dynamic ones those needed to execute tests) time-consuming while static built on top code features) less precise. Therefore, embedding recently, which assess via token sequences extracted from changed patch. However, rarely considered context information program structures patch, crucial as revealed studies. In this study, we explore idea context-aware change considering assessment. Specifically, given not only focus but also take correlated unchanged part into consideration, through can be leveraged. We then utilize AST path technique representation where structure node captured. Finally, based several pre-defined heuristics, build deep learning classifier predict implemented Cache performed extensive experiments its effectiveness. Our results demonstrate (1) perform better than previous (e.g., relatively outperforms \( \approx \) 6%, 3%, 16%, respectively under three diverse experiment settings), (2) achieve overall higher performance APCA even being precise certain including PATCH-SIM (92.9% vs. 83.0%). Further reveal leveraged contributed significantly outstanding performance.

Language: Английский

Citations

48

A Survey of Learning-based Automated Program Repair DOI Open Access
Quanjun Zhang, Chunrong Fang, Yuxiang Ma

et al.

ACM Transactions on Software Engineering and Methodology, Journal Year: 2023, Volume and Issue: 33(2), P. 1 - 69

Published: Nov. 6, 2023

Automated program repair (APR) aims to fix software bugs automatically and plays a crucial role in development maintenance. With the recent advances deep learning (DL), an increasing number of APR techniques have been proposed leverage neural networks learn bug-fixing patterns from massive open-source code repositories. Such learning-based usually treat as machine translation (NMT) task, where buggy snippets (i.e., source language) are translated into fixed target automatically. Benefiting powerful capability DL hidden relationships previous datasets, achieved remarkable performance. In this article, we provide systematic survey summarize current state-of-the-art research community. We illustrate general workflow detail components, including fault localization, patch generation, ranking, validation, correctness phases. then discuss widely adopted datasets evaluation metrics outline existing empirical studies. several critical aspects techniques, such domains, industrial deployment, open science issue. highlight practical guidelines on applying for future studies, exploring explainable generation utilizing features. Overall, our article can help researchers gain comprehensive understanding about achievements promote application these techniques. Our artifacts publicly available at repository: https://github.com/iSEngLab/AwesomeLearningAPR .

Language: Английский

Citations

37

KNOD: Domain Knowledge Distilled Tree Decoder for Automated Program Repair DOI
Nan Jiang, Thibaud Lutellier, Yiling Lou

et al.

Published: May 1, 2023

Automated Program Repair (APR) improves soft-ware reliability by generating patches for a buggy program automatically. Recent APR techniques leverage deep learning (DL) to build models learn generate from existing and code corpora. While promising, DL-based suffer the abundant syntactically or semantically incorrect in patch space. These often disobey syntactic semantic domain knowledge of source thus cannot be correct fix bug. We propose approach KNOD, which in-corporates guide generation direct comprehensive way. KNOD has two major novelties, including (1) novel three-stage tree decoder, directly generates Abstract Syntax Trees patched according inherent structure, (2) domain-rule distillation, leverages rules teacher-student distributions explicitly inject into decoding procedure during both training inference phases. evaluate on three widely-used benchmarks. fixes 72 bugs Defects4J v1.2, 25 QuixBugs, 50 additional v2.0 benchmarks, outperforming all tools.

Language: Английский

Citations

26

Automated patch correctness assessment DOI
Shangwen Wang, Ming Wen, Bo Lin

et al.

Published: Dec. 21, 2020

Test-based automated program repair (APR) has attracted huge attention from both industry and academia. Despite the significant progress made in recent studies, overfitting problem (i.e., generated patch is plausible but overfitting) still a major long-standing challenge. Therefore, plenty of techniques have been proposed to assess correctness patches either generation phase or evaluation APR techniques. However, effectiveness existing not systematically compared little known their advantages disadvantages. To fill this gap, we performed large-scale empirical study paper. Specifically, investigated assessment techniques, including static dynamic ones, based on 902 automatically by 21 tools 4 different categories. Our revealed following findings: (1) code features with respect syntax semantics are generally effective differentiating over correct ones; (2) can achieve high precision while heuristics more towards recall; (3) certain projects types less others; (4) highly complementary each other. For instance, single technique only detect at most 53.5% 93.3% them be detected least one when oracle information available. Based our findings, designed an integration strategy first integrate via learning, then combine others majority voting strategy. experiments show that enhance performance significantly.

Language: Английский

Citations

69