Cited by Proof Repair Utilizing Large Language Models: A Case Study on the Copland Remote Attestation Proofbase

From Informal to Formal – Incorporating and Evaluating LLMs on Natural Language Requirements to Verifiable Formal Proofs DOI

Jialun Cao, Yaojie Lu

Published: Feb. 18, 2025

The research in AI-based formal mathematical reasoning has shown an unstoppable growth trend. These studies have excelled competitions like IMO and made significant progress. However, these intertwined multiple skills simultaneously—problem-solving, reasoning, writing specifications—making it hard to precisely identify the LLMs’ strengths weaknesses each task. This paper focuses on verification, immediate application scenario of breaks down into sub-tasks. We constructed 18k high-quality instruction-response pairs across five mainstream specification languages (Coq, Lean4, Dafny, ACSL, TLA+) six tasks by distilling gpt-4o evaluated against ten open-sourced LLMs, including recent popular DeepSeek-R1. found that LLMs are good at proof segments when given either code, or detailed description steps. Also, fine-tuning brought about a nearly threefold improvement most. Interestingly, we observed with data also enhances mathematics, coding capabilities. Fine-tuned models released facilitate subsequent https://huggingface.co/fm-universe.

Language: Английский

Citations

Challenges and Paths Towards AI for Software Engineering DOI

Alex Gu, Naman Jain, Weizhong Li

et al.

Published: April 4, 2025

AI for software engineering has made remarkable progress recently, becoming a notable success within generative AI. Despite this, there are still many challenges that need to be addressed before automated reaches its full potential. It should possible reach high levels of automation where humans can focus on the critical decisions what build and how balance difficult tradeoffs while most routine development effort is away. Reaching this level will require substantial research efforts across academia industry. In paper, we aim discuss towards in threefold manner. First, provide structured taxonomy concrete tasks engineering, emphasizing other beyond code generation completion. Second, outline several key bottlenecks limit current approaches. Finally, an opinionated list promising directions toward making these bottlenecks, hoping inspire future rapidly maturing field.

Language: Английский

Citations

Metamorph: Synthesizing Large Objects from Dafny Specifications DOI

Aleksandr Fedchin,

Alexander Y. Bai,

Jeffrey S. Foster

et al.

Proceedings of the ACM on Programming Languages, Journal Year: 2025, Volume and Issue: 9(OOPSLA1), P. 759 - 785

Published: April 9, 2025

Program synthesis aims to produce code that adheres user-provided specifications. In this work, we focus on synthesizing sequences of calls formally specified APIs generate objects satisfy certain properties. This problem is particularly relevant in automated test generation, where a engine may need an object with specific properties trigger given execution path. Constructing instances complex data structures require dozens method calls, but reasoning about consecutive computationally expensive, and existing work typically limits the number solution. paper, such long Dafny programming language. To end, introduce Metamorph, tool uses counterexamples returned by verifier reason effects one at time, limiting complexity solver queries. We also aim limit overall SMT queries comparing using two distance metrics develop for guiding process. particular, novel piecewise metric, which puts provably correct lower bound solution allows us frame as weighted A* search. When computing distance, view states conjunctions atomic constraints, identify constraints each call can satisfy, combine information integer programming. evaluate Metamorph’s ability large six benchmarks defining key structures: linked lists, queues, arrays, binary trees, graphs. Metamorph successfully construct programs up 57 per instance compares favorably alternative baseline approach. Additionally, integrate DTest, Dafny’s generation toolkit, show synthesize inputs parts AWS Cryptographic Material Providers Library DTest alone not able cover. Finally, use executable bytecode simple virtual machine, demonstrating techniques described here are more broadly applicable context specification-guided synthesis.

Language: Английский

Citations

Laurel: Unblocking Automated Verification with Large Language Models DOI

Eric Mugnier,

Emmanuel Anaya Gonzalez,

Nadia Polikarpova

et al.

Proceedings of the ACM on Programming Languages, Journal Year: 2025, Volume and Issue: 9(OOPSLA1), P. 1519 - 1545

Published: April 9, 2025

Program verifiers such as Dafny automate proofs by outsourcing them to an SMT solver. This automation is not perfect, however, and the solver often requires hints in form of assertions , creating a burden for proof engineer. In this paper, we propose tool that alleviates automatically generating using large language models (LLMs). To improve success rate LLMs task, design two domain-specific prompting techniques. First, help LLM determine location missing assertion analyzing verifier’s error message inserting placeholder at location. Second, provide with example from same codebase, which select based on new similarity metric. We evaluate our techniques benchmark dataset complex lemmas extracted three real-world codebases. Our evaluation shows able generate over 56.6% required given only few attempts, making affordable unblocking program without human intervention.

Language: Английский

Citations

Guiding Enumerative Program Synthesis with Large Language Models DOI

Yixuan Li, Julian Parsert, Elizabeth Polgreen

et al.

Lecture notes in computer science, Journal Year: 2024, Volume and Issue: unknown, P. 280 - 301

Published: Jan. 1, 2024

Abstract Pre-trained Large Language Models (LLMs) are beginning to dominate the discourse around automatic code generation with natural language specifications. In contrast, best-performing synthesizers in domain of formal synthesis precise logical specifications still based on enumerative algorithms. this paper, we evaluate abilities LLMs solve benchmarks by carefully crafting a library prompts for domain. When one-shot fails, propose novel algorithm, which integrates calls an LLM into weighted probabilistic search. This allows synthesizer provide information about progress enumerator, and enumerator syntactic guidance iterative loop. We our techniques from Syntax-Guided Synthesis (SyGuS) competition. find that GPT-3.5 as stand-alone tool is easily outperformed state-of-the-art algorithms, but approach integrating algorithm shows significant performance gains over both alone winning SyGuS competition tool.

Language: Английский

Citations

Towards Combining the Cognitive Abilities of Large Language Models with the Rigor of Deductive Progam Verification DOI

Bernhard Beckert, Jonas Klamroth, Wolfram Pfeifer

et al.

Lecture notes in computer science, Journal Year: 2024, Volume and Issue: unknown, P. 242 - 257

Published: Oct. 25, 2024

Language: Английский

Citations

Proof Repair Utilizing Large Language Models: A Case Study on the Copland Remote Attestation Proofbase DOI

Amer Tahat,

David Hardin,

Adam Petz

et al.

Lecture notes in computer science, Journal Year: 2024, Volume and Issue: unknown, P. 145 - 166

Published: Dec. 29, 2024

Language: Английский

Citations