Cited by Improved Flaky Test Detection with Black-Box Approach and Test Smells

FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning DOI

Amal Akli, Guillaume Haben, Sarra Habchi

et al.

Published: May 1, 2023

Flaky tests are that yield different outcomes when run on the same version of a program. This non-deterministic behaviour plagues continuous integration with false signals, wasting developers' time and reducing their trust in test suites. Studies highlighted importance keeping flakiness-free. Recently, research community has been pushing towards detection flaky by suggesting many static dynamic approaches. While promising, those approaches mainly focus classifying as or not and, even high performances reported, it remains challenging to understand cause flakiness. part is crucial for researchers developers aim fix it. To help comprehension given test, we propose FlakyCat, first approach classify based root category. FlakyCat relies CodeBERT code representation leverages Siamese networks train multi-class classifier. We evaluate set 451 collected from open-source Java projects. Our evaluation shows categorises accurately, an F1 score 73%. Furthermore, investigate performance our each category, revealing Async waits, Unordered collections Time-related accurately classified, while Concurrency-related more predict. Finally, facilitate FlakyCat's predictions, present new technique CodeBERT-based model interpretability highlights statements influencing categorization.

Language: Английский

Citations

Quantizing Large-Language Models for Predicting Flaky Tests DOI

Shanto Rahman,

Abdelrahman Baz,

Saša Misailovíc

et al.

Published: May 27, 2024

Language: Английский

Citations

Test flakiness’ causes, detection, impact and responses: A multivocal review DOI

Amjed Tahir, Shawn Rasheed, Jens Dietrich

et al.

Journal of Systems and Software, Journal Year: 2023, Volume and Issue: 206, P. 111837 - 111837

Published: Sept. 7, 2023

Flaky tests (tests with non-deterministic outcomes) pose a major challenge for software testing. They are known to cause significant issues, such as reducing the effectiveness and efficiency of testing delaying releases. In recent years, there has been an increased interest in flaky tests, research focusing on different aspects flakiness, identifying causes, detection methods mitigation strategies. Test flakiness also become key discussion point practitioners (in blog posts, technical magazines, etc.) impact is felt across industry. This paper presents multivocal review that investigates how topic, have addressed both practice. Out 560 articles we reviewed, identified analysed total 200 focused (composed 109 academic 91 grey literature articles/posts) structured body relevant knowledge using four dimensions: detection, responses. For each those dimensions, provide categorization classify existing research, discussions, tools With this, comprehensive current snapshot thinking test covering views industrial practices, identify limitations opportunities future research.

Language: Английский

Citations

Do Automatic Test Generation Tools Generate Flaky Tests? DOI

Martin Gruber, Muhammad Firhard Roslan, Owain Parry

et al.

Published: Feb. 6, 2024

Non-deterministic test behavior, or flakiness, is common and dreaded among developers. Researchers have studied the issue proposed approaches to mitigate it. However, vast majority of previous work has only considered developer-written tests. The prevalence nature flaky tests produced by generation tools remain largely unknown. We ask whether such also produce how these differ from ones. Furthermore, we evaluate mechanisms that suppress generation. sample 6 356 projects written in Java Python. For each project, generate using EvoSuite (Java) Pynguin (Python), execute 200 times, looking for inconsistent outcomes. Our results show flakiness at least as generated Nevertheless, existing suppression implemented are effective alleviating this (71.7 % fewer tests). Compared tests, causes distributed differently. Their non-deterministic behavior more frequently caused randomness, rather than networking concurrency. Using suppression, remaining significantly any previously reported, where most attributable runtime optimizations EvoSuite-internal resource thresholds. These insights, with accompanying dataset, can help maintainers improve tools, give recommendations developers serve a foundation future research

Language: Английский

Citations

Can ChatGPT Repair Non-Order-Dependent Flaky Tests? DOI

Yang Chen, Reyhaneh Jabbarvand

Published: April 14, 2024

Language: Английский

Citations

iPFlakies: A Framework for Detecting and Fixing Python Order-Dependent Flaky Tests DOI

Ruixin Wang, Yang Chen, Wing Lam

et al.

2022 IEEE/ACM 44th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), Journal Year: 2022, Volume and Issue: unknown, P. 120 - 124

Published: May 1, 2022

Developers typically run tests after code changes. Flaky tests, which are that can nondeterministically pass and fail when on the same version of code, mislead developers about their recent Much prior work flaky is focused Java projects. One prominent category order-dependent (OD) or depending order in run. For example, our proposed using other test suite to fix (or correctly set up) state needed by OD pass.Unlike programming languages have received less attention. To help with this problem, another piece recently studied Python projects detected many tests. Unfortunately, did not identify suites be used fill gap, we propose iPFlakies, a framework for automatically detecting fixing Using extend work's dataset include (1) reproduce (2) patches Our finds reproducing passing failing results difficult iPFlakies effective at aid future research, make framework, improvements, experimental infrastructure publicly available.

Language: Английский

Citations

Test Code Flakiness in Mobile Apps: The Developer's Perspective DOI

Valeria Pontillo, Fabio Palomba, Filomena Ferrucci

et al.

Published: Jan. 1, 2023

Context: Test flakiness arises when test cases have a non-deterministic, intermittent behavior that leads them to either pass or fail run against the same code. While researchers been contributing detection, classification, and removal of flaky tests with several empirical studies automated techniques, little is known about how problem in mobile applications.Objective: We point out lack knowledge on: (1) The prominence harmfulness problem; (2) most frequent root causes inducing flakiness; (3) strategies applied by practitioners deal it practice. An improved understanding these matters may lead software engineering research community assess need for tailoring existing instruments context brand-new approaches focus on peculiarities identified.Method: address this gap means an study into developer's perception flakiness. first perform systematic grey literature review elicit developers discuss wild. Then, we complement through survey involves 130 aims at analyzing their experience matter.Result: results indicate are often concerned connected user interface elements. In addition, our reveals perceived as critical developers, who pointed major production code- source code design-related flakiness, other than long-term effects recurrent tests. Furthermore, lets diagnosing fixing processes currently adopted limitations emerge.Conclusion: conclude distilling lessons learned, implications, future directions.

Language: Английский

Citations

Test Code Flakiness in Mobile Apps: The Developer’s Perspective DOI

Valeria Pontillo, Fabio Palomba, Filomena Ferrucci

et al.

Information and Software Technology, Journal Year: 2024, Volume and Issue: 168, P. 107394 - 107394

Published: Jan. 6, 2024

Test flakiness arises when test cases have a non-deterministic, intermittent behavior that leads them to either pass or fail run against the same code. While researchers been contributing detection, classification, and removal of flaky tests with several empirical studies automated techniques, little is known about how problem in mobile applications. We point out lack knowledge on: (1) The prominence harmfulness problem; (2) most frequent root causes inducing flakiness; (3) strategies applied by practitioners deal it practice. An improved understanding these matters may lead software engineering research community assess need for tailoring existing instruments context brand-new approaches focus on peculiarities identified. address this gap means an study into developer's perception flakiness. first perform systematic grey literature review elicit developers discuss wild. Then, we complement through survey involves 130 aims at analyzing their experience matter. results indicate are often concerned connected user interface elements. In addition, our reveals perceived as critical developers, who pointed major production code- source code design-related flakiness, other than long-term effects recurrent tests. Furthermore, lets diagnosing fixing processes currently adopted limitations emerge. conclude distilling lessons learned, implications, future directions.

Language: Английский

Citations

Neurosymbolic Repair of Test Flakiness DOI

Yang Chen, Reyhaneh Jabbarvand

Published: Sept. 11, 2024

Language: Английский

Citations

Non-Flaky and Nearly-Optimal Time-based Treatment of Asynchronous Wait Web Tests DOI

Yu Pei, Jeongju Sohn, Sarra Habchi

et al.

ACM Transactions on Software Engineering and Methodology, Journal Year: 2024, Volume and Issue: unknown

Published: Sept. 13, 2024

Asynchronous waits are a common root cause of flaky tests and major time-influential factor web application testing. We build dataset 49 reproducible asynchronous wait their fixes from 26 open-source projects to study characteristics in Our reveals that developers adjusted time address flakiness about 63% cases (31 out 49), even when the underlying causes lie elsewhere. From this, we introduce TRaf , an automated time-based repair for applications. determines appropriate times calls applications by analyzing code similarity past change history. Its key insight is efficient can be inferred current or codebase since tend repeat similar mistakes. analysis shows statically suggest shorter alleviate async immediately upon detection, reducing test execution 11.1% compared timeout values initially chosen developers. With optional dynamic tuning, reduce 16.8% its initial refinement developer-written patches 6.2% post-refinements these original patches. Overall, sent 16 pull requests our dataset, each fixing one test, So far, three have been accepted

Language: Английский

Citations