VeriSoftBench

Software verification proofs are uniquely challenging: they require reasoning over deep chains of project-specific definitions, large library APIs, and the domain semantics of the code being verified.

VeriSoftBench provides 500 theorem-proving tasks from 23 real-world Lean 4 repositories, each with its full dependency context organized into four layers (library, project, local, theorem). The best model solves 41% of tasks with curated context—and only 35% given the raw repository, showing that real-world verification remains a major open challenge.

Model Comparison

Model performance measured by Pass@8 (8 proof attempts per task). Curated context: dependency-traced definitions and lemmas. Full repo context: entire source file. r = repair attempts with Lean error feedback (0 = none).

VeriSoftBench-Full (500 tasks)

Model	Curated Context	Full Repo Context
Loading…

VeriSoftBench-Aristotle (100 tasks)

100 tasks compatible with Aristotle, enabling direct comparison between LLM generation and search-based proving.

Model	Curated Context	Full Repo Context
Loading…

Gödel-Prover-v2 was evaluated on 496 tasks (curated) and 44 tasks (full repo) due to context window limitations. For VeriSoftBench-Aristotle, both models are evaluated on a variation of the full repo context configuration.

Why software verification proofs?

Existing LLM proof benchmarks focus on competition mathematics or textbook exercises. Real software verification is structurally different: proofs depend on deep chains of project-specific definitions, interact with large library APIs (Mathlib, Std4), and require understanding the domain semantics of the code being verified—whether that’s a compiler IR, a type system, or a smart contract invariant.

VeriSoftBench captures this structure. Each task comes with its full dependency context, organized into layers that reflect how real projects are built.

Contextual dependencies comparison between mathematical benchmarks, lightweight verification, and repository-scale verification — Contextual dependencies comparison between (a) mathematical benchmark proofs, (b) lightweight verification tasks, and (c) repository-scale verification. Purple denotes library dependencies; yellow denotes local/repository dependencies.

Proof domain	Benchmark	Language	Project defs	Multi-module ctx	Repo-level retrieval	Ground-truth proofs
Mathematics	MiniF2F	Lean	✗	✗	✗	✗
	ProofNet	Lean	✗	✗	✗	✗
	LeanDojo	Lean	✓	✓	✓	✓
	PutnamBench	Multiple	✗	✗	✗	✗

Formally Verified Code Generation	FVAPPS	Lean	✗	✗	✗	✗
	CLEVER	Lean	✗	✗	✗	✗
	Verina	Lean	✓	✗	✗	◑

Verification	MiniCodeProps	Lean	✓	✗	✗	✗

PL Theory & Verification	VeriSoftBench (Ours)	Lean	✓	✓	✓	✓

While LeanDojo satisfies the listed criteria, its scope is confined to the singular environment of Mathlib. VeriSoftBench features 23 diverse repositories with bespoke contexts.

Structured dependency context

Every task provides the theorem’s dependencies at four levels of scope. This is what distinguishes real verification proofs from isolated math problems: a single theorem may draw on library types, project-wide invariants, and locally-scoped helpers, all of which a model must reason over.

Library

Definitions and lemmas from Lean’s standard library and Mathlib, referenced by name and source module.
Avg. 15.8 defs + 11.8 lemmas per task.

Project

Definitions and lemmas from elsewhere in the same repository, provided with full source. All 500 tasks have project-level dependencies.
Avg. 33.7 defs + 7.8 lemmas per task.

Local

Definitions and lemmas from the same file, plus the surrounding file context (namespace, variables, opens).
Avg. 5.6 defs + 5.4 lemmas; 1,673 chars of context.

Theorem

The target theorem statement to prove, with its full signature and type.

68

avg. transitive deps

480

max transitive deps

5.4

avg. imports per task

Domains

Tasks are drawn from five verification domains, covering the breadth of how Lean 4 is used in practice for software correctness.

Domain	Tasks	Examples
Applied Verification	195	Smart contracts, crypto protocols, numerical libraries
Compiler	84	IR transformations, optimization passes, correctness
Type Systems	82	Type soundness, capability systems, metatheory
Frameworks	79	Separation logic (Iris), verification libraries
Semantics	60	Operational semantics, program logics, Hoare logic

Sample tasks

Each task includes the theorem, its structured dependencies, and a ground-truth proof from the original repo.

Vec.le_div_iff Applied Verif. CvxLean

Statement

lemma le_div_iff (hc : StrongLT 0 c) : a ≤ b / c ↔ a * c ≤ b

Imports

CvxLean.Lib.Math.Data.Real, Mathlib.Analysis.NormedSpace.PiLp, +2 more

Lib defs

StrongLT (Mathlib.Order.Basic)

Repo defs

1 definition from project

Depth / Deps

nesting 1 · 1 transitive dep

Proof

by constructor · intro h i; have hi := h i; simp at hi; rw [_root_.le_div_iff (hc i)] at hi; exact hi · intro h i; have hi := h i; simp at hi; dsimp; rw [_root_.le_div_iff (hc i)]; exact hi

Iris.BI.plainly_sep_dup Framework iris-lean

Statement

theorem plainly_sep_dup : ■ P ⊣⊢ ■ P ∗ ■ P

Imports

Iris.BI.Interface, Iris.BI.DerivedLaws, +1 more

Lib defs

references to Std4 and Lean core

Repo defs

BIBase, plainly, sep, wand, +8 more

Depth / Deps

nesting 3 · 14 transitive deps

Proof

apply anti_symm <;> [plainly_absorb, and_self >>> plainly_and_sep_assoc]

SType.trename_trename Type Systems capless-lean

Statement

theorem trename_trename (S : SType n m k) : S.trename f |>.trename g = S.trename (g ∘ f)

Imports

Capless.Type, Capless.Renaming, +1 more

Repo defs

SType, CType, EType, CaptureSet, trename, +11 more (16 total)

Local lemmas

2 lemmas from the same file

Depth / Deps

nesting 4 · 15 transitive deps

Proof

726 chars — mutual induction on the type structure

Citation

@misc{xin2026verisoftbenchrepositoryscaleformalverification,
  title={VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean}, 
  author={Yutong Xin and Qiaochu Chen and Greg Durrett and Işil Dillig},
  year={2026},
  eprint={2602.18307},
  archivePrefix={arXiv},
  primaryClass={cs.SE},
  url={https://arxiv.org/abs/2602.18307}, 
}