InternLM-Step-Prover

July 25, 2024 · View on GitHub

InternLM-Math HOT

A state-of-the-art LEAN4 step prover.

💻 Github 📊Dataset

InternLM-Step-Prover is a 7B language model trained on Lean-Github and multiple sythesis datasets. InternLM-Step-Prover achieves state-of-the-art performances on MiniF2F, ProofNet, and Putnam math benchmarks, showing its formal math proving ability in multiple domains.

Dialogue Example

### Input Example
DECL MyNat.mul_pow
GOAL a b n : N
⊢ (a * b) ^ n = a ^ n * b ^ n
### Output Example
PROOFSTEP induction n with t Ht

Performance

MiniF2F

MethodModel sizePassminiF2F-validminiF2F-test
Whole-Proof Generation Methods
GPT-4-turbo 0409-6425.4%23.0%
DeepSeekMath-Base7B12825.4%27.5%
DeepSeek-Prover7B1-30.0%
64-46.3%
128-46.3%
8192-48.8%
65536-50.0%
cumulative60.2%52.0%
TheoremLlama-cumulative36.5%33.6%
Tree Search Methods
COPRA (GPT-3.5)-1-9.0%
COPRA (GPT-4)-1-26.6%
DSP(Isabelle)540B10042.6%38.9%
Proof Artifact Co-Training837M123.9%24.6%
829.3%29.2%
ReProver229M1-25.0%
Llemma7B126.2%26.2%
Llemma34B127.9%25.8%
Curriculum Learning837M133.6%29.6%
841.2%34.5%
6447.3%36.6%
Hypertree Proof Search600Mcumulative58.6%-
64-41.0%
Lean-STaR7B64-46.3%
InternLM2-Math7B129.9%30.3%
InternLM2-Math-Plus7B1-43.4%
InternLM2-Step-Prover7B159.8%48.8%
InternLM2-Step-Prover7B6463.9%54.5%

Proofnet & Putnam

MethodModel sizePassresult
ProofNet benchmark
ReProver229M113.8%
InternLM2-Step-Prover7B118.1%
Putnam benchmark
GPT-4-101/640
COPRA (GPT-4)-101/640
DSP(Isabelle)540B104/640
ReProver229M10/640
InternLM2-Step-Prover7B15/640

Citation and Tech Report

@misc{wu2024leangithubcompilinggithublean,
      title={LEAN-GitHub: Compiling GitHub LEAN repositories for a versatile LEAN prover}, 
      author={Zijian Wu and Jiayu Wang and Dahua Lin and Kai Chen},
      year={2024},
      eprint={2407.17227},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2407.17227}, 
}