Cited by Combining Coverage and Expert Features with Semantic Representation for Coincidental Correctness Detection

A3Test: Assertion-Augmented Automated Test case generation DOI

Saranya Alagarsamy, Chakkrit Tantithamthavorn, Aldeida Aleti

et al.

Information and Software Technology, Journal Year: 2024, Volume and Issue: 176, P. 107565 - 107565

Published: Aug. 30, 2024

Language: Английский

Citations

Towards AI-Assisted Synthesis of Verified Dafny Methods DOI

Md Rakib Hossain Misu, Cristina Videira Lopes, Iris Ma

et al.

Proceedings of the ACM on software engineering., Journal Year: 2024, Volume and Issue: 1(FSE), P. 812 - 835

Published: July 12, 2024

Large language models show great promise in many domains, including programming. A is easy to make but hard keep, and often fail keep their promises, generating erroneous code. promising avenue honest incorporate formal verification: programs' specifications as well code so that the can be proved correct with respect specifications. Unfortunately, existing large a severe lack of proficiency verified In this paper, we demonstrate how improve two pretrained models' Dafny verification-aware language. Using 178 problems from MBPP dataset, prompt contemporary (GPT-4 PaLM-2) synthesize methods. We use three different types prompts: direct Contextless prompt; Signature includes method signature test cases, Chain Thought (CoT) decomposes problem into steps retrieval augmentation generated example solutions. Our results GPT-4 performs better than PaLM-2 on these tasks both perform best CoT prompt. was able generate verified, human-evaluated, methods for 58% problems, however, managed only 19% prompt, even fewer (10%) are thus contribute 153 solutions 50 wrote manually, 103 synthesized by GPT-4. benefits program verification now within reach models...

Language: Английский

Citations

SAGA: Summarization-Guided Assert Statement Generation DOI

Yu-Wei Zhang, Zhi Jin, Zejun Wang

et al.

Journal of Computer Science and Technology, Journal Year: 2025, Volume and Issue: 40(1), P. 138 - 157

Published: Jan. 1, 2025

Language: Английский

Citations

Assessing Evaluation Metrics for Neural Test Oracle Generation DOI

Jiho Shin, Hadi Hemmati, Moshi Wei

et al.

IEEE Transactions on Software Engineering, Journal Year: 2024, Volume and Issue: 50(9), P. 2337 - 2349

Published: July 25, 2024

Recently, deep learning models have shown promising results in test oracles generation.Static evaluation metrics from Natural Language Generation (NLG) such as BLEU, CodeBLEU, ROUGE-L, METEOR, and Accuracy, which is mainly based on textual comparisons, been widely adopted to measure the performance of Neural Oracle (NOG) models.However, these NLG-based may not reflect testing effectiveness generated oracle within a suite, often measured by dynamic (execution-based) adequacy code coverage mutation score.In this work, we revisit existing generation studies plus ChatGPT empirically investigate current standing their both metrics.Specifically, train run four state-of-the-art five two for our analysis.We apply different correlation analyses between sets metrics.Surprisingly, found no significant metrics.For instance, project activemq-artemis had highest all among studied NOGs, however, it most number projects with decrease compared NOGs.We further conduct qualitative analysis explore reasons behind observations, that high but low tend complex or multiple chained method invocations oracle's parameters, making hard model generate completely, affecting metrics.On other hand, call assertion types functions similarly ones ground truth.Overall, work complements prior an extensive NLG provides guidelines better assessment applications software future.

Language: Английский

Citations

Generative AI for Software Development: A Family of Studies on Code Generation DOI

Arghavan Moradi Dakhel, Amin Nikanjam,

Foutse Khomh

et al.

Published: Jan. 1, 2024

The rapid advancements in generative artificial intelligence (AI) offer multiple opportunities for its application various domains, including software engineering (SE). This chapter explores the benefits and challenges of utilizing AI different activities development cycle that involve code generation. We review approaches leveraging AI, either independently or combination with traditional SE techniques, to complete a diverse set tasks feature implementation, generating test cases, repairing programs. Additionally, we discuss potential pitfalls using perform such tasks, as well quality generated by these models. Finally, explore research harnessing particular emphasis on require

Language: Английский

Citations

An Empirical Study on Focal Methods in Deep-Learning-Based Approaches for Assertion Generation DOI

Yibo He, Jiaming Huang, Hao Yu

et al.

Proceedings of the ACM on software engineering., Journal Year: 2024, Volume and Issue: 1(FSE), P. 1750 - 1771

Published: July 12, 2024

Unit testing is widely recognized as an essential aspect of the software development process. Generating high-quality assertions automatically one most important and challenging problems in automatic unit test generation. To generate assertions, deep-learning-based approaches have been proposed recent years. For state-of-the-art d eep- l earning-based for a ssertion g eneration (DLAGs), focal method (i.e., main under test) case plays role being required part input to these approaches. use DLAGs practice, there are two ways provide approaches: (1) manually providing developer-intended or (2) identifying likely from given prefix complete code excluding assertions) with test-to-code traceability techniques. However, all evaluated on ATLAS dataset, where assumed last non-JUnit-API invoked both assertion portion). There exist issues existing empirical evaluations DLAGs, causing inaccurate assessment toward adoption practice. First, it unclear whether call before (LCBA) technique can accurately reflect methods. Second, when applying portion not available (actually output DLAGs); thus, assumption made by dataset does hold practical scenarios DLAGs. address first issue, we conduct study seven techniques scenario We find that LCBA best among identify methods only 43.38% precision 38.42% recall; cannot methods, raising concern using evaluation. second issue along raised preceding finding, apply , respectively, prefixes construct new named ATLAS+ replacing identified techniques, respectively. On set ATLAS+, evaluate four trained training dataset. achieve lower accuracy than corresponding indicating should be (re)evaluated which better reflects In addition, sets ATLAS+. helps effectively improve approach T5 over

Language: Английский

Citations

Domain Adaptation for Code Model-Based Unit Test Case Generation DOI

Jiho Shin, Sepehr Hashtroudi, Hadi Hemmati

et al.

Published: Sept. 11, 2024

Language: Английский

Citations

Measuring Software Testability via Automatically Generated Test Cases DOI

L. Di Guglielmo, Leonardo Mariani, Giovanni Denaro

et al.

IEEE Access, Journal Year: 2024, Volume and Issue: 12, P. 63904 - 63916

Published: Jan. 1, 2024

Estimating software testability can crucially assist managers to optimize test budgets and quality. In this paper, we propose a new approach that radically differs from the traditional of pursuing measurements based on metrics, e.g., size code or complexity designs. Our exploits automatic generation mutation analysis quantify evidence about relative hardness developing effective cases. elaborate intuitions methodological choices underlie our proposal for estimating testability, introduce technique prototype allows concretely accordingly, discuss findings out set experiments in which compare performance estimations both against combination with metrics. The results show estimates capture complementary dimension be synergistically combined approaches metrics improve accuracy predictions.

Language: Английский

Citations

From Fine-tuning to Output: An Empirical Investigation of Test Smells in Transformer-Based Test Code Generation DOI

Ahmed Aljohani, Hyunsook Do

Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing, Journal Year: 2024, Volume and Issue: unknown, P. 1282 - 1291

Published: April 8, 2024

Researchers have recently leveraged transformer-based test code generation models to improve testing-related tasks (e.g., assert completion and method generation). One such model, AthenaTest, has been introduced, it well accepted by developers due its ability generate cases similar the ones written developers. While AthenaTest model provides adequate coverage improves readability, concerns remain regarding quality of generated code, particularly presence smells, which could degrade code's comprehension, performance, maintainability. In this paper, we investigated whether a - contain including smells in fine-tuning dataset (Methods2Test). We evaluated seven AthenaTest's Methods2Test's cases. Our results reveal that 65% influence output where 62% at least one smell. also examined design (test size assertion frequency) Methods2Test findings show tends more assertions than case, influenced increase occurrence rate Assertion Roulette Duplicate Assert smells.

Language: Английский

Citations

An Overview on Large Language Models DOI

Arghavan Moradi Dakhel, Amin Nikanjam,

Foutse Khomh

et al.

Published: Jan. 1, 2024

Generative artificial intelligence (AI), propelled by the advancements in large language models (LLMs), has exhibited remarkable capabilities various software engineering (SE) tasks and beyond. This development influenced research studies this domain. chapter offers an overview of LLMs, delving into relevant background concepts while exploring advanced techniques at forefront LLM research. We review architectures, addition to discussing training, fine-tuning, in-context learning. also discussed different adaptation approaches LLMs augmented LLMs. Furthermore, we delve evaluation research, introducing benchmark datasets tools context. The concludes limitations leveraging for SE tasks.

Language: Английский

Citations