Skip to main content
The training configuration classes control all aspects of model training, including optimization, hardware setup, checkpointing, and experiment tracking.

TrainingConfig

The TrainingConfig class defines parameters for training Vision-Language-Action models.

Output configuration

output_dir
str
default:"./outputs"
Directory where model checkpoints, logs, and outputs are saved.
experiment_name
str | None
default:"None"
Optional name for the experiment. Used for organizing outputs and tracking.

Basic training parameters

max_steps
int
default:"30000"
Total number of training steps to run. This overrides num_epochs.
global_batch_size
int
default:"1024"
Total effective batch size across all GPUs and accumulation steps.
batch_size
int | None
default:"None"
Per-device batch size. If None, calculated from global_batch_size.
gradient_accumulation_steps
int
default:"1"
Number of forward passes to accumulate before performing a backward/update step.

Optimization parameters

learning_rate
float
default:"1e-4"
Initial learning rate for the optimizer.
lr_scheduler_type
str
default:"cosine"
Learning rate scheduler type (e.g., cosine, linear, constant).
weight_decay
float
default:"1e-5"
Weight decay coefficient for optimizer (L2 regularization).
warmup_ratio
float
default:"0.05"
Proportion of total training steps used for learning rate warm-up.
warmup_steps
int
default:"0"
Number of warm-up steps. Overrides warmup_ratio if set.
max_grad_norm
float
default:"1.0"
Maximum gradient norm for gradient clipping.
optim
str
default:"adamw_torch_fused"
Optimizer choice. Options include:
  • adamw_torch: Standard AdamW from PyTorch
  • adamw_torch_fused: Fused AdamW (faster)
  • paged_adamw_32bit: Paged AdamW 32-bit (requires bitsandbytes)
  • paged_adamw_8bit: Paged AdamW 8-bit (requires bitsandbytes)
  • adafactor: Adafactor optimizer
start_from_checkpoint
str | None
default:"None"
Path to a checkpoint to resume training from.

Mixed precision training

tf32
bool
default:"True"
Enable TF32 mode for NVIDIA Ampere GPUs and later.
fp16
bool
default:"False"
Enable FP16 mixed precision training.
bf16
bool
default:"True"
Enable BF16 mixed precision training.
eval_bf16
bool
default:"True"
Use BF16 for evaluation.

Logging and checkpointing

logging_steps
int
default:"10"
Frequency (in training steps) at which to log training metrics.
save_steps
int
default:"1000"
Frequency (in training steps) at which to save checkpoints.
save_total_limit
int
default:"5"
Maximum number of checkpoints to keep before older ones are deleted.
save_vl_model
bool
default:"False"
Control whether to save VL model and processor in callbacks.

Checkpoint uploading

upload_checkpoints
bool
default:"False"
Enable automatic checkpoint uploading.
upload_every
int
default:"1000"
Upload checkpoints every N steps.
upload_last_n_checkpoints
int
default:"5"
Number of most recent checkpoints to keep uploaded.
max_concurrent_uploads
int
default:"2"
Maximum number of concurrent checkpoint uploads.

Evaluation parameters

eval_strategy
str
default:"no"
Evaluation strategy: no, steps, or epoch.
eval_steps
int
default:"500"
Frequency (in steps) at which to run evaluation.
eval_set_split_ratio
float
default:"0.1"
Ratio of data to use for evaluation split.
eval_batch_size
int
default:"2"
Batch size for evaluation.
save_best_eval_metric_name
str
default:""
Name of the metric to use for saving best checkpoints.
save_best_eval_metric_greater_is_better
bool
default:"True"
Whether higher values of the eval metric are better.

DeepSpeed configuration

deepspeed_stage
int
default:"2"
ZeRO optimization stage (1, 2, or 3).
gradient_checkpointing
bool
default:"False"
Enable gradient checkpointing to reduce memory usage.

Transformers loading parameters

transformers_trust_remote_code
bool
default:"True"
Trust remote code when loading models from Hugging Face Hub.
transformers_local_files_only
bool
default:"False"
Only use local files (no downloads from Hugging Face Hub).
transformers_cache_dir
str | None
default:"None"
Directory for caching Hugging Face models.
transformers_access_token
str | None
default:"None"
Access token for Hugging Face Hub (for private models).

DDP configuration

use_ddp
bool
default:"False"
Use DistributedDataParallel instead of DeepSpeed.
ddp_bucket_cap_mb
int
default:"100"
DDP bucket capacity in MB for gradient communication.

Hardware configuration

num_gpus
int
default:"1"
Number of GPUs to use for training.
dataloader_num_workers
int
default:"2"
Number of parallel worker processes for data loading.

Data handling

remove_unused_columns
bool
default:"False"
Whether to remove unused columns from the dataset.

Experiment tracking

use_wandb
bool
default:"False"
Enable Weights & Biases (wandb) logging.
wandb_project
str
default:"finetune-gr00t-n1d6"
Wandb project name for tracking experiments.

Performance profiling

enable_profiling
bool
default:"False"
Enable PyTorch profiler for performance analysis.

Fault tolerance

max_retries
int
default:"3"
Maximum number of retries in training for fault tolerance.

Testing

assert_loss_less_than
float | None
default:"None"
For testing: assert that loss is less than this value.

Reinforcement learning

add_rl_callback
bool
default:"False"
Add reinforcement learning callback during training.

Open-loop evaluation

enable_open_loop_eval
bool
default:"False"
Enable open-loop evaluation on saved checkpoints.
open_loop_eval_traj_ids
list[int]
default:"[0]"
List of trajectory IDs to evaluate.
open_loop_eval_steps_per_traj
int
default:"100"
Number of steps to evaluate per trajectory.
open_loop_eval_plot_indices
list[int] | None
default:"None"
List of action indices to plot. If None, plots all indices.

FinetuneConfig

The FinetuneConfig class is a simplified configuration specifically designed for single-node fine-tuning. See launch_finetune.py for detailed parameter descriptions.

Key differences from TrainingConfig

  • Focused on single-node training scenarios
  • Includes embodiment-specific parameters
  • Provides granular control over which model components to tune
  • Includes data augmentation parameters
  • Simplified parameter set compared to full TrainingConfig

Usage example

from gr00t.configs.training.training_config import TrainingConfig

config = TrainingConfig(
    output_dir="./my_experiment",
    experiment_name="robot_v1",
    max_steps=50000,
    global_batch_size=512,
    learning_rate=5e-5,
    num_gpus=4,
    use_wandb=True,
    wandb_project="my-gr00t-project"
)

Build docs developers (and LLMs) love