Software verification proofs are uniquely challenging: they require reasoning over deep chains of project-specific definitions, large library APIs, and the domain semantics of the code being verified.

VeriSoftBench provides 500 theorem-proving tasks from 23 real-world Lean 4 repositories, each with its full dependency context organized into four layers (library, project, local, theorem). The best model solves 41% of tasks with curated context—and only 35% given the raw repository, showing that real-world verification remains a major open challenge.

Model Comparison

Model performance measured by Pass@8 (8 proof attempts per task). Curated context: dependency-traced definitions and lemmas. Full repo context: entire source file. r = repair attempts with Lean error feedback (0 = none).

VeriSoftBench-Full (500 tasks)

Model Curated Context Full Repo Context
Loading…

VeriSoftBench-Aristotle (100 tasks)

100 tasks compatible with Aristotle, enabling direct comparison between LLM generation and search-based proving.

Model Curated Context Full Repo Context
Loading…

Gödel-Prover-v2 was evaluated on 496 tasks (curated) and 44 tasks (full repo) due to context window limitations. For VeriSoftBench-Aristotle, both models are evaluated on a variation of the full repo context configuration.

Why software verification proofs?

Existing LLM proof benchmarks focus on competition mathematics or textbook exercises. Real software verification is structurally different: proofs depend on deep chains of project-specific definitions, interact with large library APIs (Mathlib, Std4), and require understanding the domain semantics of the code being verified—whether that’s a compiler IR, a type system, or a smart contract invariant.

VeriSoftBench captures this structure. Each task comes with its full dependency context, organized into layers that reflect how real projects are built.

Contextual dependencies comparison between mathematical benchmarks, lightweight verification, and repository-scale verification
Contextual dependencies comparison between (a) mathematical benchmark proofs, (b) lightweight verification tasks, and (c) repository-scale verification. Purple denotes library dependencies; yellow denotes local/repository dependencies.
Proof domain Benchmark Language Project defs Multi-module ctx Repo-level retrieval Ground-truth proofs
Mathematics MiniF2FLean
ProofNetLean
LeanDojoLean
PutnamBenchMultiple
Formally Verified Code Generation FVAPPSLean
CLEVERLean
VerinaLean
Verification MiniCodePropsLean
PL Theory & Verification VeriSoftBench (Ours)Lean

While LeanDojo satisfies the listed criteria, its scope is confined to the singular environment of Mathlib. VeriSoftBench features 23 diverse repositories with bespoke contexts.

Structured dependency context

Every task provides the theorem’s dependencies at four levels of scope. This is what distinguishes real verification proofs from isolated math problems: a single theorem may draw on library types, project-wide invariants, and locally-scoped helpers, all of which a model must reason over.

Library
Definitions and lemmas from Lean’s standard library and Mathlib, referenced by name and source module.
Avg. 15.8 defs + 11.8 lemmas per task.
Project
Definitions and lemmas from elsewhere in the same repository, provided with full source. All 500 tasks have project-level dependencies.
Avg. 33.7 defs + 7.8 lemmas per task.
Local
Definitions and lemmas from the same file, plus the surrounding file context (namespace, variables, opens).
Avg. 5.6 defs + 5.4 lemmas; 1,673 chars of context.
Theorem
The target theorem statement to prove, with its full signature and type.
68
avg. transitive deps
480
max transitive deps
5.4
avg. imports per task

Domains

Tasks are drawn from five verification domains, covering the breadth of how Lean 4 is used in practice for software correctness.

DomainTasksExamples
Applied Verification 195 Smart contracts, crypto protocols, numerical libraries
Compiler 84 IR transformations, optimization passes, correctness
Type Systems 82 Type soundness, capability systems, metatheory
Frameworks 79 Separation logic (Iris), verification libraries
Semantics 60 Operational semantics, program logics, Hoare logic

Sample tasks

Each task includes the theorem, its structured dependencies, and a ground-truth proof from the original repo.

An example task instance showing the target theorem statement, local file context, and repository dependencies
An example task from VeriSoftBench. The goal is to prove cexec_to_reds, which relates two definitions of program execution. Color annotations trace dependencies from the theorem to its relevant context across local file context and repository dependencies.
Vec.le_div_iff Applied Verif. CvxLean
Statement
lemma le_div_iff (hc : StrongLT 0 c) : a ≤ b / c ↔ a * c ≤ b
Imports
CvxLean.Lib.Math.Data.Real, Mathlib.Analysis.NormedSpace.PiLp, +2 more
Lib defs
StrongLT (Mathlib.Order.Basic)
Repo defs
1 definition from project
Depth / Deps
nesting 1 · 1 transitive dep
Proof
by constructor · intro h i; have hi := h i; simp at hi; rw [_root_.le_div_iff (hc i)] at hi; exact hi · intro h i; have hi := h i; simp at hi; dsimp; rw [_root_.le_div_iff (hc i)]; exact hi
Iris.BI.plainly_sep_dup Framework iris-lean
Statement
theorem plainly_sep_dup : ■ P ⊣⊢ ■ P ∗ ■ P
Imports
Iris.BI.Interface, Iris.BI.DerivedLaws, +1 more
Lib defs
references to Std4 and Lean core
Repo defs
BIBase, plainly, sep, wand, +8 more
Depth / Deps
nesting 3 · 14 transitive deps
Proof
apply anti_symm <;> [plainly_absorb, and_self >>> plainly_and_sep_assoc]
SType.trename_trename Type Systems capless-lean
Statement
theorem trename_trename (S : SType n m k) : S.trename f |>.trename g = S.trename (g ∘ f)
Imports
Capless.Type, Capless.Renaming, +1 more
Repo defs
SType, CType, EType, CaptureSet, trename, +11 more (16 total)
Local lemmas
2 lemmas from the same file
Depth / Deps
nesting 4 · 15 transitive deps
Proof
726 chars — mutual induction on the type structure

Citation

@misc{xin2026verisoftbenchrepositoryscaleformalverification,
  title={VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean}, 
  author={Yutong Xin and Qiaochu Chen and Greg Durrett and Işil Dillig},
  year={2026},
  eprint={2602.18307},
  archivePrefix={arXiv},
  primaryClass={cs.SE},
  url={https://arxiv.org/abs/2602.18307}, 
}