Published: Oct. 18, 2024
Language: Английский
Published: Oct. 18, 2024
Language: Английский
Information and Software Technology, Journal Year: 2024, Volume and Issue: 176, P. 107565 - 107565
Published: Aug. 30, 2024
Language: Английский
Citations
14Proceedings of the ACM on software engineering., Journal Year: 2024, Volume and Issue: 1(FSE), P. 812 - 835
Published: July 12, 2024
Large language models show great promise in many domains, including programming. A is easy to make but hard keep, and often fail keep their promises, generating erroneous code. promising avenue honest incorporate formal verification: programs' specifications as well code so that the can be proved correct with respect specifications. Unfortunately, existing large a severe lack of proficiency verified In this paper, we demonstrate how improve two pretrained models' Dafny verification-aware language. Using 178 problems from MBPP dataset, prompt contemporary (GPT-4 PaLM-2) synthesize methods. We use three different types prompts: direct Contextless prompt; Signature includes method signature test cases, Chain Thought (CoT) decomposes problem into steps retrieval augmentation generated example solutions. Our results GPT-4 performs better than PaLM-2 on these tasks both perform best CoT prompt. was able generate verified, human-evaluated, methods for 58% problems, however, managed only 19% prompt, even fewer (10%) are thus contribute 153 solutions 50 wrote manually, 103 synthesized by GPT-4. benefits program verification now within reach models...
Language: Английский
Citations
8Journal of Computer Science and Technology, Journal Year: 2025, Volume and Issue: 40(1), P. 138 - 157
Published: Jan. 1, 2025
Language: Английский
Citations
0IEEE Transactions on Software Engineering, Journal Year: 2024, Volume and Issue: 50(9), P. 2337 - 2349
Published: July 25, 2024
Recently, deep learning models have shown promising results in test oracles generation.Static evaluation metrics from Natural Language Generation (NLG) such as BLEU, CodeBLEU, ROUGE-L, METEOR, and Accuracy, which is mainly based on textual comparisons, been widely adopted to measure the performance of Neural Oracle (NOG) models.However, these NLG-based may not reflect testing effectiveness generated oracle within a suite, often measured by dynamic (execution-based) adequacy code coverage mutation score.In this work, we revisit existing generation studies plus ChatGPT empirically investigate current standing their both metrics.Specifically, train run four state-of-the-art five two for our analysis.We apply different correlation analyses between sets metrics.Surprisingly, found no significant metrics.For instance, project activemq-artemis had highest all among studied NOGs, however, it most number projects with decrease compared NOGs.We further conduct qualitative analysis explore reasons behind observations, that high but low tend complex or multiple chained method invocations oracle's parameters, making hard model generate completely, affecting metrics.On other hand, call assertion types functions similarly ones ground truth.Overall, work complements prior an extensive NLG provides guidelines better assessment applications software future.
Language: Английский
Citations
3Published: Jan. 1, 2024
The rapid advancements in generative artificial intelligence (AI) offer multiple opportunities for its application various domains, including software engineering (SE). This chapter explores the benefits and challenges of utilizing AI different activities development cycle that involve code generation. We review approaches leveraging AI, either independently or combination with traditional SE techniques, to complete a diverse set tasks feature implementation, generating test cases, repairing programs. Additionally, we discuss potential pitfalls using perform such tasks, as well quality generated by these models. Finally, explore research harnessing particular emphasis on require
Language: Английский
Citations
1Proceedings of the ACM on software engineering., Journal Year: 2024, Volume and Issue: 1(FSE), P. 1750 - 1771
Published: July 12, 2024
Unit testing is widely recognized as an essential aspect of the software development process. Generating high-quality assertions automatically one most important and challenging problems in automatic unit test generation. To generate assertions, deep-learning-based approaches have been proposed recent years. For state-of-the-art d eep- l earning-based for a ssertion g eneration (DLAGs), focal method (i.e., main under test) case plays role being required part input to these approaches. use DLAGs practice, there are two ways provide approaches: (1) manually providing developer-intended or (2) identifying likely from given prefix complete code excluding assertions) with test-to-code traceability techniques. However, all evaluated on ATLAS dataset, where assumed last non-JUnit-API invoked both assertion portion). There exist issues existing empirical evaluations DLAGs, causing inaccurate assessment toward adoption practice. First, it unclear whether call before (LCBA) technique can accurately reflect methods. Second, when applying portion not available (actually output DLAGs); thus, assumption made by dataset does hold practical scenarios DLAGs. address first issue, we conduct study seven techniques scenario We find that LCBA best among identify methods only 43.38% precision 38.42% recall; cannot methods, raising concern using evaluation. second issue along raised preceding finding, apply , respectively, prefixes construct new named ATLAS+ replacing identified techniques, respectively. On set ATLAS+, evaluate four trained training dataset. achieve lower accuracy than corresponding indicating should be (re)evaluated which better reflects In addition, sets ATLAS+. helps effectively improve approach T5 over
Language: Английский
Citations
1Published: Sept. 11, 2024
Language: Английский
Citations
1IEEE Access, Journal Year: 2024, Volume and Issue: 12, P. 63904 - 63916
Published: Jan. 1, 2024
Estimating software testability can crucially assist managers to optimize test budgets and quality. In this paper, we propose a new approach that radically differs from the traditional of pursuing measurements based on metrics, e.g., size code or complexity designs. Our exploits automatic generation mutation analysis quantify evidence about relative hardness developing effective cases. elaborate intuitions methodological choices underlie our proposal for estimating testability, introduce technique prototype allows concretely accordingly, discuss findings out set experiments in which compare performance estimations both against combination with metrics. The results show estimates capture complementary dimension be synergistically combined approaches metrics improve accuracy predictions.
Language: Английский
Citations
0Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing, Journal Year: 2024, Volume and Issue: unknown, P. 1282 - 1291
Published: April 8, 2024
Researchers have recently leveraged transformer-based test code generation models to improve testing-related tasks (e.g., assert completion and method generation). One such model, AthenaTest, has been introduced, it well accepted by developers due its ability generate cases similar the ones written developers. While AthenaTest model provides adequate coverage improves readability, concerns remain regarding quality of generated code, particularly presence smells, which could degrade code's comprehension, performance, maintainability. In this paper, we investigated whether a - contain including smells in fine-tuning dataset (Methods2Test). We evaluated seven AthenaTest's Methods2Test's cases. Our results reveal that 65% influence output where 62% at least one smell. also examined design (test size assertion frequency) Methods2Test findings show tends more assertions than case, influenced increase occurrence rate Assertion Roulette Duplicate Assert smells.
Language: Английский
Citations
0Published: Jan. 1, 2024
Generative artificial intelligence (AI), propelled by the advancements in large language models (LLMs), has exhibited remarkable capabilities various software engineering (SE) tasks and beyond. This development influenced research studies this domain. chapter offers an overview of LLMs, delving into relevant background concepts while exploring advanced techniques at forefront LLM research. We review architectures, addition to discussing training, fine-tuning, in-context learning. also discussed different adaptation approaches LLMs augmented LLMs. Furthermore, we delve evaluation research, introducing benchmark datasets tools context. The concludes limitations leveraging for SE tasks.
Language: Английский
Citations
0