Benchmarking LLMs on Real-World Software Verification in Lean 4
500 theorem-proving tasks from 23 real-world projects, spanning compilers, type systems, smart contracts, separation logic, and program semantics.
1The University of Texas at Austin 2New York University *Equal contribution
Software verification proofs are uniquely challenging: they require reasoning over deep chains of project-specific definitions, large library APIs, and the domain semantics of the code being verified.
VeriSoftBench provides 500 theorem-proving tasks from 23 real-world Lean 4 repositories, each with its full dependency context organized into four layers (library, project, local, theorem). The best model solves 41% of tasks with curated context—and only 35% given the raw repository, showing that real-world verification remains a major open challenge.
Model performance measured by Pass@8 (8 proof attempts per task). Curated context: dependency-traced definitions and lemmas. Full repo context: entire source file. r = repair attempts with Lean error feedback (0 = none).
| Model | Curated Context | Full Repo Context |
|---|---|---|
| Loading… | ||
100 tasks compatible with Aristotle, enabling direct comparison between LLM generation and search-based proving.
| Model | Curated Context | Full Repo Context |
|---|---|---|
| Loading… | ||
Gödel-Prover-v2 was evaluated on 496 tasks (curated) and 44 tasks (full repo) due to context window limitations. For VeriSoftBench-Aristotle, both models are evaluated on a variation of the full repo context configuration.
Existing LLM proof benchmarks focus on competition mathematics or textbook exercises. Real software verification is structurally different: proofs depend on deep chains of project-specific definitions, interact with large library APIs (Mathlib, Std4), and require understanding the domain semantics of the code being verified—whether that’s a compiler IR, a type system, or a smart contract invariant.
VeriSoftBench captures this structure. Each task comes with its full dependency context, organized into layers that reflect how real projects are built.
| Proof domain | Benchmark | Language | Project defs | Multi-module ctx | Repo-level retrieval | Ground-truth proofs |
|---|---|---|---|---|---|---|
| Mathematics | MiniF2F | Lean | ✗ | ✗ | ✗ | ✗ |
| ProofNet | Lean | ✗ | ✗ | ✗ | ✗ | |
| LeanDojo | Lean | ✓ | ✓ | ✓ | ✓ | |
| PutnamBench | Multiple | ✗ | ✗ | ✗ | ✗ | |
| Formally Verified Code Generation | FVAPPS | Lean | ✗ | ✗ | ✗ | ✗ |
| CLEVER | Lean | ✗ | ✗ | ✗ | ✗ | |
| Verina | Lean | ✓ | ✗ | ✗ | ◑ | |
| Verification | MiniCodeProps | Lean | ✓ | ✗ | ✗ | ✗ |
| PL Theory & Verification | VeriSoftBench (Ours) | Lean | ✓ | ✓ | ✓ | ✓ |
While LeanDojo satisfies the listed criteria, its scope is confined to the singular environment of Mathlib. VeriSoftBench features 23 diverse repositories with bespoke contexts.
Every task provides the theorem’s dependencies at four levels of scope. This is what distinguishes real verification proofs from isolated math problems: a single theorem may draw on library types, project-wide invariants, and locally-scoped helpers, all of which a model must reason over.
15.8 defs + 11.8 lemmas per task.
33.7 defs + 7.8 lemmas per task.
5.6 defs + 5.4 lemmas; 1,673 chars of context.
Tasks are drawn from five verification domains, covering the breadth of how Lean 4 is used in practice for software correctness.
| Domain | Tasks | Examples |
|---|---|---|
| Applied Verification | 195 | Smart contracts, crypto protocols, numerical libraries |
| Compiler | 84 | IR transformations, optimization passes, correctness |
| Type Systems | 82 | Type soundness, capability systems, metatheory |
| Frameworks | 79 | Separation logic (Iris), verification libraries |
| Semantics | 60 | Operational semantics, program logics, Hoare logic |
Each task includes the theorem, its structured dependencies, and a ground-truth proof from the original repo.
cexec_to_reds,
which relates two definitions of program execution. Color annotations trace dependencies
from the theorem to its relevant context across local file context and repository dependencies.
@misc{xin2026verisoftbenchrepositoryscaleformalverification,
title={VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean},
author={Yutong Xin and Qiaochu Chen and Greg Durrett and Işil Dillig},
year={2026},
eprint={2602.18307},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2602.18307},
}