ProgressTrainer: Fine-Tuning a Step Prediction Model

ProgressTrainer turns any HuggingFace sequence-classification encoder into a regression model that estimates steps_remaining — the number of tactic steps still needed to close a proof. During training the model receives a structured input that concatenates the current proof goal, any accumulated proof prefix, and a candidate tactic; it is supervised on the integer steps_remaining label from the dataset. At inference time the resulting checkpoint can be queried by setting the LEANPROGRESS_MODEL environment variable, enabling the prover to use step-count predictions as a heuristic for proof-search ordering.

Class

lean_dojo_v2.trainer.progress_trainer.ProgressTrainer

Constructor Parameters

data_path

str

required

Path to a JSONL file where each line is a JSON object with at least "goal", "tactic", and "steps_remaining" fields. Lines missing any of these keys are silently skipped. An optional "prefix" field (accumulated proof context) is also supported.

model_name

str

default:"bert-base-uncased"

HuggingFace encoder model identifier. The model is loaded as a single-label regression head (num_labels=1, problem_type="regression") via AutoModelForSequenceClassification.

output_dir

str

default:"outputs"

Directory where the final model checkpoint and tokenizer are saved. Created automatically if it does not exist.

max_length

int

default:"512"

Maximum tokenised sequence length. Inputs longer than this are truncated.

batch_size

int

default:"8"

Per-device batch size used for both training and evaluation.

epochs

float

default:"3.0"

Total training epochs. Fractional values (e.g. 0.5) are accepted by the underlying transformers.Trainer.

learning_rate

float

default:"1e-5"

Peak learning rate for the AdamW optimiser.

eval_ratio

float

default:"0.2"

Fraction of the dataset held out for evaluation (passed to Dataset.train_test_split).

seed

int

default:"42"

Random seed for dataset splitting and the transformers.Trainer.

`train` Method

ProgressTrainer.train() -> None

Loads the JSONL file at data_path through ProgressDataset.
Splits into train / eval with Dataset.train_test_split(test_size=eval_ratio).

Tokenises both splits by building an input of the form:

Goal:
<goal>

Prefix:
<prefix>

Candidate tactic:
<tactic>

Instantiates transformers.Trainer with eval_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, and metric_for_best_model="mse".
Runs training and reports MSE and MAE at the end of each epoch.
Saves the best checkpoint and tokenizer to output_dir and prints the resolved path alongside a reminder to set LEANPROGRESS_MODEL.

Dataset Format

Each line of the JSONL file must be a JSON object. Required keys:

Key	Type	Description
`goal`	`str`	The current Lean 4 proof state / goal string.
`tactic`	`str`	The candidate tactic being evaluated.
`steps_remaining`	`int`	Ground-truth remaining step count after applying the tactic.
`prefix` (optional)	`str`	Accumulated proof context preceding the current goal.

Example line:

{
  "goal": "⊢ ∀ n : ℕ, 0 + n = n",
  "prefix": "intro n",
  "tactic": "simp",
  "steps_remaining": 1
}

Generating a Sample Dataset

The lean_dojo_v2.lean_progress.create_sample_dataset module ships a synthetic dataset generator for quick experimentation:

python -m lean_dojo_v2.lean_progress.create_sample_dataset \
    --output raid/data/sample_leanprogress_dataset.jsonl

Full Example

This mirrors examples/lean_progress.py exactly:

from pathlib import Path
from lean_dojo_v2.trainer.progress_trainer import ProgressTrainer

trainer = ProgressTrainer(
    model_name="bert-base-uncased",
    data_path="raid/data/sample_leanprogress_dataset.jsonl",
    output_dir="outputs-progress",
)

trainer.train()

After training, set the environment variable to activate step-count predictions in the prover:

export LEANPROGRESS_MODEL=outputs-progress

The synthetic sample dataset is useful for verifying that the training pipeline runs end-to-end, but it does not reflect the distribution of real Lean 4 proof states. For production-quality step predictions, generate training data from real proof traces with DynamicDatabase.export_merged_data() and adapt the loader in ProgressDataset to consume the exported JSON format. Real traces provide far more diverse goal states and accurate step counts derived from actual theorem proofs.

Advanced: Custom Hyperparameters

from lean_dojo_v2.trainer.progress_trainer import ProgressTrainer

trainer = ProgressTrainer(
    model_name="bert-base-uncased",
    data_path="raid/data/sample_leanprogress_dataset.jsonl",
    output_dir="outputs-progress",
    max_length=512,
    batch_size=16,
    epochs=5.0,
    learning_rate=2e-5,
    eval_ratio=0.1,
    seed=0,
)
trainer.train()

ProgressTrainer uses AutoModelForSequenceClassification with num_labels=1 and problem_type="regression", so the model predicts a single scalar. The loss is mean-squared error (MSE), not cross-entropy. Post-training, load the checkpoint with AutoModelForSequenceClassification.from_pretrained(output_dir) for standalone inference.

Regression models are sensitive to the scale of the target values. If steps_remaining varies widely in your dataset (e.g., 1–500), consider normalising the targets before training or using a log transform (log1p(steps_remaining)) to prevent the loss from being dominated by long-proof outliers.

Agents

Trainers

Provers

Data & Database

ProgressTrainer: Fine-Tuning a Step Prediction Model

Class

Constructor Parameters

`train` Method

Dataset Format

Generating a Sample Dataset

Full Example

Advanced: Custom Hyperparameters

Build docs developers (and LLMs) love

Agents

Trainers

Provers

Data & Database

Documentation Index

​Class

​Constructor Parameters

​train Method

​Dataset Format

​Generating a Sample Dataset

​Full Example

​Advanced: Custom Hyperparameters

Build docs developers (and LLMs) love

Class

Constructor Parameters

`train` Method

Dataset Format

Generating a Sample Dataset

Full Example

Advanced: Custom Hyperparameters