Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lean-dojo/LeanDojo-v2/llms.txt
Use this file to discover all available pages before exploring further.
ProgressTrainer turns any HuggingFace sequence-classification encoder into a regression model that estimates steps_remaining — the number of tactic steps still needed to close a proof. During training the model receives a structured input that concatenates the current proof goal, any accumulated proof prefix, and a candidate tactic; it is supervised on the integer steps_remaining label from the dataset. At inference time the resulting checkpoint can be queried by setting the LEANPROGRESS_MODEL environment variable, enabling the prover to use step-count predictions as a heuristic for proof-search ordering.
Class
Constructor Parameters
Path to a JSONL file where each line is a JSON object with at least
"goal",
"tactic", and "steps_remaining" fields. Lines missing any of these keys
are silently skipped. An optional "prefix" field (accumulated proof context)
is also supported.HuggingFace encoder model identifier. The model is loaded as a single-label
regression head (
num_labels=1, problem_type="regression") via
AutoModelForSequenceClassification.Directory where the final model checkpoint and tokenizer are saved. Created
automatically if it does not exist.
Maximum tokenised sequence length. Inputs longer than this are truncated.
Per-device batch size used for both training and evaluation.
Total training epochs. Fractional values (e.g.
0.5) are accepted by the
underlying transformers.Trainer.Peak learning rate for the AdamW optimiser.
Fraction of the dataset held out for evaluation (passed to
Dataset.train_test_split).Random seed for dataset splitting and the
transformers.Trainer.train Method
-
Loads the JSONL file at
data_paththroughProgressDataset. -
Splits into train / eval with
Dataset.train_test_split(test_size=eval_ratio). -
Tokenises both splits by building an input of the form:
-
Instantiates
transformers.Trainerwitheval_strategy="epoch",save_strategy="epoch",load_best_model_at_end=True, andmetric_for_best_model="mse". - Runs training and reports MSE and MAE at the end of each epoch.
-
Saves the best checkpoint and tokenizer to
output_dirand prints the resolved path alongside a reminder to setLEANPROGRESS_MODEL.
Dataset Format
Each line of the JSONL file must be a JSON object. Required keys:| Key | Type | Description |
|---|---|---|
goal | str | The current Lean 4 proof state / goal string. |
tactic | str | The candidate tactic being evaluated. |
steps_remaining | int | Ground-truth remaining step count after applying the tactic. |
prefix (optional) | str | Accumulated proof context preceding the current goal. |
Generating a Sample Dataset
Thelean_dojo_v2.lean_progress.create_sample_dataset module ships a synthetic
dataset generator for quick experimentation:
Full Example
This mirrorsexamples/lean_progress.py exactly:
Advanced: Custom Hyperparameters
ProgressTrainer uses AutoModelForSequenceClassification with
num_labels=1 and problem_type="regression", so the model predicts a
single scalar. The loss is mean-squared error (MSE), not cross-entropy.
Post-training, load the checkpoint with
AutoModelForSequenceClassification.from_pretrained(output_dir) for
standalone inference.