Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lean-dojo/LeanDojo-v2/llms.txt

Use this file to discover all available pages before exploring further.

ProgressTrainer turns any HuggingFace sequence-classification encoder into a regression model that estimates steps_remaining — the number of tactic steps still needed to close a proof. During training the model receives a structured input that concatenates the current proof goal, any accumulated proof prefix, and a candidate tactic; it is supervised on the integer steps_remaining label from the dataset. At inference time the resulting checkpoint can be queried by setting the LEANPROGRESS_MODEL environment variable, enabling the prover to use step-count predictions as a heuristic for proof-search ordering.

Class

lean_dojo_v2.trainer.progress_trainer.ProgressTrainer

Constructor Parameters

data_path
str
required
Path to a JSONL file where each line is a JSON object with at least "goal", "tactic", and "steps_remaining" fields. Lines missing any of these keys are silently skipped. An optional "prefix" field (accumulated proof context) is also supported.
model_name
str
default:"bert-base-uncased"
HuggingFace encoder model identifier. The model is loaded as a single-label regression head (num_labels=1, problem_type="regression") via AutoModelForSequenceClassification.
output_dir
str
default:"outputs"
Directory where the final model checkpoint and tokenizer are saved. Created automatically if it does not exist.
max_length
int
default:"512"
Maximum tokenised sequence length. Inputs longer than this are truncated.
batch_size
int
default:"8"
Per-device batch size used for both training and evaluation.
epochs
float
default:"3.0"
Total training epochs. Fractional values (e.g. 0.5) are accepted by the underlying transformers.Trainer.
learning_rate
float
default:"1e-5"
Peak learning rate for the AdamW optimiser.
eval_ratio
float
default:"0.2"
Fraction of the dataset held out for evaluation (passed to Dataset.train_test_split).
seed
int
default:"42"
Random seed for dataset splitting and the transformers.Trainer.

train Method

ProgressTrainer.train() -> None
  1. Loads the JSONL file at data_path through ProgressDataset.
  2. Splits into train / eval with Dataset.train_test_split(test_size=eval_ratio).
  3. Tokenises both splits by building an input of the form:
    Goal:
    <goal>
    
    Prefix:
    <prefix>
    
    Candidate tactic:
    <tactic>
    
  4. Instantiates transformers.Trainer with eval_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, and metric_for_best_model="mse".
  5. Runs training and reports MSE and MAE at the end of each epoch.
  6. Saves the best checkpoint and tokenizer to output_dir and prints the resolved path alongside a reminder to set LEANPROGRESS_MODEL.

Dataset Format

Each line of the JSONL file must be a JSON object. Required keys:
KeyTypeDescription
goalstrThe current Lean 4 proof state / goal string.
tacticstrThe candidate tactic being evaluated.
steps_remainingintGround-truth remaining step count after applying the tactic.
prefix (optional)strAccumulated proof context preceding the current goal.
Example line:
{
  "goal": "⊢ ∀ n : ℕ, 0 + n = n",
  "prefix": "intro n",
  "tactic": "simp",
  "steps_remaining": 1
}

Generating a Sample Dataset

The lean_dojo_v2.lean_progress.create_sample_dataset module ships a synthetic dataset generator for quick experimentation:
python -m lean_dojo_v2.lean_progress.create_sample_dataset \
    --output raid/data/sample_leanprogress_dataset.jsonl

Full Example

This mirrors examples/lean_progress.py exactly:
from pathlib import Path
from lean_dojo_v2.trainer.progress_trainer import ProgressTrainer

trainer = ProgressTrainer(
    model_name="bert-base-uncased",
    data_path="raid/data/sample_leanprogress_dataset.jsonl",
    output_dir="outputs-progress",
)

trainer.train()
After training, set the environment variable to activate step-count predictions in the prover:
export LEANPROGRESS_MODEL=outputs-progress
The synthetic sample dataset is useful for verifying that the training pipeline runs end-to-end, but it does not reflect the distribution of real Lean 4 proof states. For production-quality step predictions, generate training data from real proof traces with DynamicDatabase.export_merged_data() and adapt the loader in ProgressDataset to consume the exported JSON format. Real traces provide far more diverse goal states and accurate step counts derived from actual theorem proofs.

Advanced: Custom Hyperparameters

from lean_dojo_v2.trainer.progress_trainer import ProgressTrainer

trainer = ProgressTrainer(
    model_name="bert-base-uncased",
    data_path="raid/data/sample_leanprogress_dataset.jsonl",
    output_dir="outputs-progress",
    max_length=512,
    batch_size=16,
    epochs=5.0,
    learning_rate=2e-5,
    eval_ratio=0.1,
    seed=0,
)
trainer.train()
ProgressTrainer uses AutoModelForSequenceClassification with num_labels=1 and problem_type="regression", so the model predicts a single scalar. The loss is mean-squared error (MSE), not cross-entropy. Post-training, load the checkpoint with AutoModelForSequenceClassification.from_pretrained(output_dir) for standalone inference.
Regression models are sensitive to the scale of the target values. If steps_remaining varies widely in your dataset (e.g., 1–500), consider normalising the targets before training or using a log transform (log1p(steps_remaining)) to prevent the loss from being dominated by long-proof outliers.

Build docs developers (and LLMs) love