Evaluating Representation Learning of Code Changes for Predicting Patch Correctness in Program Repair DOI Creative Commons
Haoye Tian, Kui Liu,

Abdoul Kader Kaboreé

et al.

arXiv (Cornell University), Journal Year: 2020, Volume and Issue: unknown

Published: Jan. 1, 2020

A large body of the literature automated program repair develops approaches where patches are generated to be validated against an oracle (e.g., a test suite). Because such can imperfect, patches, although by oracle, may actually incorrect. While state art explore research directions that require dynamic information or rely on manually-crafted heuristics, we study benefit learning code representations learn deep features encode properties patch correctness. Our work mainly investigates different representation for changes derive embeddings amenable similarity computations. We report findings based produced pre-trained and re-trained neural networks. Experimental results demonstrate potential empower algorithms in reasoning about correctness: machine predictor with BERT transformer-based associated logistic regression yielded AUC value 0.8 predicting correctness deduplicated dataset 1000 labeled patches. shows learned lead reasonable performance when comparing state-of-the-art, PATCH-SIM, which relies information. These further complementary were carefully (manually) engineered literature.

Language: Английский

Neural program repair with execution-based backpropagation DOI Open Access
He Ye, Matías Martínez, Martin Monperrus

et al.

Proceedings of the 44th International Conference on Software Engineering, Journal Year: 2022, Volume and Issue: unknown, P. 1506 - 1518

Published: May 21, 2022

Neural machine translation (NMT) architectures have achieved promising results for automatic program repair. Yet, they the limitation of generating low-quality patches (e.g., not compilable patches). This is because existing works only optimize a purely syntactic loss function based on characters and tokens without incorporating program-specific information during neural network weight optimization. In this paper, we propose novel repair model called RewardRepair. The core novelty RewardRepair to improve NMT-based with compilation test execution information, rewarding produce that compile do overfit. We conduct several experiments evaluate showing it feasible effective use underlying model. correctly repairs 207 bugs over four benchmarks. report success 121 are fixed first time in literature. Also, produces up 45.3% patches, an improvement 39% by state-of-the-art.

Language: Английский

Citations

93

Empirical review of Java program repair tools: a large-scale experiment on 2,141 bugs and 23,551 repair attempts DOI
Thomas Durieux, Fernanda Madeiral, Matías Martínez

et al.

Published: Aug. 9, 2019

In the past decade, research on test-suite-based automatic program repair has grown significantly. Each year, new approaches and implementations are featured in major software engineering venues. However, most of those evaluated a single benchmark bugs, which also rarely reproduced by other researchers. this paper, we present large-scale experiment using 11 Java tools 2,141 bugs from 5 benchmarks. Our goal is to have better understanding current state large diversity investigation guided hypothesis that repairability might not be generalized across different We found 1) able generate patches for 21% benchmarks, 2) performance Defects4J compared generating 47% 10-30% comprises 23,551 attempts, used find causes non-patch generation. These reported can help tool designers improve their tools.

Language: Английский

Citations

115

Fuzz testing based data augmentation to improve robustness of deep neural networks DOI
Xiang Gao, Ripon K. Saha,

Mukul R. Prasad

et al.

Published: June 27, 2020

Deep neural networks (DNN) have been shown to be notoriously brittle small perturbations in their input data. This problem is analogous the over-fitting test-based program synthesis and automatic repair, which a consequence of incomplete specification, i.e., limited tests or training examples, that repair algorithm has learn from. Recently, test generation techniques successfully employed augment existing specifications intended behavior, improve generalizability repair. Inspired by these approaches, this paper, we propose technique re-purposes software testing methods, specifically mutation-based fuzzing, data DNNs, with objective enhancing robustness. Our casts DNN augmentation as an optimization problem. It uses genetic search generate most suitable variant use for DNN, while simultaneously identifying opportunities accelerate skipping many instances. We instantiate two tools, Sensei Sensei-SA, evaluate them on 15 models spanning 5 popular image data-sets. evaluation shows can robust accuracy compared state art, each models, upto 11.9% 5.5% average. Further, Sensei-SA reduce average time 25%, still improving accuracy.

Language: Английский

Citations

105

Evaluating representation learning of code changes for predicting patch correctness in program repair DOI
Haoye Tian, Kui Liu, Abdoul Kader Kaboré

et al.

Published: Dec. 21, 2020

A large body of the literature automated program repair develops approaches where patches are generated to be validated against an oracle (e.g., a test suite). Because such can imperfect, patches, although by oracle, may actually incorrect. While state art explore research directions that require dynamic information or rely on manually-crafted heuristics, we study benefit learning code representations in order learn deep features encode properties patch correctness. Our empirical work mainly investigates different representation for changes derive embeddings amenable similarity computations. We report findings based produced pre-trained and re-trained neural networks. Experimental results demonstrate potential empower algorithms reasoning about correctness: machine predictor with BERT transformer-based associated logistic regression yielded AUC value 0.8 prediction correctness deduplicated dataset 1000 labeled patches. investigations show learned lead reasonable performance when comparing state-of-the-art, PATCH-SIM, which relies information. These further complementary were carefully (manually) engineered literature.

Language: Английский

Citations

72

Ultra-Large Repair Search Space with Automatically Mined Templates: The Cardumen Mode of Astor DOI
Matías Martínez, Martin Monperrus

Lecture notes in computer science, Journal Year: 2018, Volume and Issue: unknown, P. 65 - 86

Published: Jan. 1, 2018

Language: Английский

Citations

76

Automated patch correctness assessment DOI
Shangwen Wang, Ming Wen, Bo Lin

et al.

Published: Dec. 21, 2020

Test-based automated program repair (APR) has attracted huge attention from both industry and academia. Despite the significant progress made in recent studies, overfitting problem (i.e., generated patch is plausible but overfitting) still a major long-standing challenge. Therefore, plenty of techniques have been proposed to assess correctness patches either generation phase or evaluation APR techniques. However, effectiveness existing not systematically compared little known their advantages disadvantages. To fill this gap, we performed large-scale empirical study paper. Specifically, investigated assessment techniques, including static dynamic ones, based on 902 automatically by 21 tools 4 different categories. Our revealed following findings: (1) code features with respect syntax semantics are generally effective differentiating over correct ones; (2) can achieve high precision while heuristics more towards recall; (3) certain projects types less others; (4) highly complementary each other. For instance, single technique only detect at most 53.5% 93.3% them be detected least one when oracle information available. Based our findings, designed an integration strategy first integrate via learning, then combine others majority voting strategy. experiments show that enhance performance significantly.

Language: Английский

Citations

69

Crash-avoiding program repair DOI
Xiang Gao, Sergey Mechtaev, Abhik Roychoudhury

et al.

Published: July 10, 2019

Existing program repair systems modify a buggy so that the modified passes given tests. The repaired may not satisfy even most basic notion of correctness, namely crash-freedom. In other words, tools might generate patches which over-fit test data driving repair, and automatically programs introduce crashes or vulnerabilities. We propose an integrated approach for detecting discarding crashing patches. Our fuses patch generation into single process, in are generated with objective passing existing tests, new tests filtering out over-fitted by distinguishing candidate terms behavior. use crash-freedom as oracle to discard candidates crash on its core, our defines grey-box fuzzing strategy gives higher priority separate behaving equivalently This identifies semantic differences between candidates, reduces over-fitting repair. evaluated real-world vulnerabilities open-source subjects from Google OSS-Fuzz infrastructure. found tool Fix2Fit (implementing space directed generation), produces crash-avoiding While we do give formal guarantees about crash-freedom, cross-validation their sanitizers provides greater confidence suggested

Language: Английский

Citations

57

Automated Classification of Overfitting Patches With Statically Extracted Code Features DOI
He Ye, Jian Gu, Matías Martínez

et al.

IEEE Transactions on Software Engineering, Journal Year: 2021, Volume and Issue: 48(8), P. 2920 - 2938

Published: April 9, 2021

Automatic program repair (APR) aims to reduce the cost of manually fixing software defects. However, APR suffers from generating a multitude overfitting patches, those patches that fail correctly defect beyond making tests pass. This paper presents novel patch detection system called ODS assess correctness patches. first statically compares patched and buggy in order extract code features at abstract syntax tree (AST) level, for single programming language Java. Then, uses supervised learning with captured labels automatically learn probabilistic model. The learned model can then finally be applied classify new unseen We conduct large-scale experiment evaluate effectiveness on classification based 10,302 Defects4J, Bugs.jar Bears benchmarks. empirical evaluation shows is able 71.9 percent 26 projects, which improves state-of-the-art. applicable practice employed as post-processing procedure generated by different systems.

Language: Английский

Citations

54

A comprehensive study of automatic program repair on the QuixBugs benchmark DOI
He Ye, Matías Martínez, Thomas Durieux

et al.

Journal of Systems and Software, Journal Year: 2020, Volume and Issue: 171, P. 110825 - 110825

Published: Sept. 22, 2020

Language: Английский

Citations

48

A Comprehensive Study of Automatic Program Repair on the QuixBugs Benchmark DOI
He Ye, Matías Martínez, Thomas Durieux

et al.

Published: Feb. 1, 2019

Automatic program repair papers tend to repeatedly use the same benchmarks. This poses a threat external validity of findings research community. In this paper, we perform an automatic experiment on benchmark called QuixBugs that has never been studied in context repair. study, report characteristics QuixBugs, and study five systems, Arja, Astor, Nopol, NPEfix RSRepair, which are representatives generate-and-validate techniques synthesis techniques. We propose three patch correctness assessment comprehensively overfitting incorrect patches. Our key results are: 1) 15/40 buggy programs can be repaired with test-suite adequate patch; 2) total 64 plausible patches for those 15 present search space considered tools; 3) discard 33/64 overfitting. sets baseline future QuixBugs. also highlights major properties challenges how automated All experimental publicly available Github order facilitate

Language: Английский

Citations

41