Documentation Index
Fetch the complete documentation index at: https://mintlify.com/pranavkrishnasuresh/chemAgent/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The LlaSMol training script uses LoRA (Low-Rank Adaptation) to efficiently fine-tune large language models on chemistry tasks. It supports distributed training, Weights & Biases integration, and custom data collation.Running Training
Basic Command
Distributed Training
Training Parameters
Model & Data
The base model to fine-tune. Supported models:
mistralai/Mistral-7B-v0.1facebook/galactica-6.7bmeta-llama/Llama-2-7b-hfcodellama/CodeLlama-7b-hf
Path to the training dataset (Hugging Face dataset path or local path).
Directory where model checkpoints will be saved.
Name of the training split in the dataset.
Name of the validation split in the dataset.
Specific tasks to train on. If None, trains on all tasks in the dataset.
Training Hyperparameters
Total batch size across all devices (effective batch size).
Batch size per device. Gradient accumulation steps are calculated as
batch_size / micro_batch_size.Number of training epochs.
Learning rate for the optimizer.
Maximum sequence length. Longer sequences will be truncated.
Optimizer to use. Options include:
adamw_bnb_8bit: 8-bit AdamW (memory efficient)adamw_torch: Standard PyTorch AdamW- Other HuggingFace-supported optimizers
Learning rate scheduler type. Options:
cosine, linear, constant, etc.Number of warmup steps for the learning rate scheduler.
Whether to use a validation set for evaluation during training.
If False, masks out inputs in the loss calculation (only compute loss on outputs).
Whether to group training samples by length. Can speed up training but may produce odd loss curves.
LoRA Hyperparameters
LoRA rank (dimensionality of the low-rank matrices).
LoRA scaling parameter.
Dropout probability for LoRA layers.
Model modules to apply LoRA to. Common options:
- Query/Key/Value projections:
q_proj,k_proj,v_proj,o_proj - MLP layers:
gate_proj,down_proj,up_proj - Embeddings:
wte
Additional modules to save (not using LoRA). Useful for saving embeddings or classification heads.
Weights & Biases Integration
Weights & Biases project name for logging.
Name for this specific W&B run.
W&B watch mode. Options:
false, gradients, all.Whether to log model checkpoints to W&B. Options:
false, true.Checkpointing
Path to a checkpoint to resume training from.
Log metrics every N steps.
Save checkpoint every N steps.
Run evaluation every N steps (if validation set is enabled).
Maximum number of checkpoints to keep. Older checkpoints are deleted.
Precision & Device
Training precision. Currently only
bf16 (bfloat16) is tested and supported.Whether to load the model in 8-bit mode (requires bitsandbytes).
Training Function
LoRA Fine-Tuning
LlaSMol uses Parameter-Efficient Fine-Tuning (PEFT) with LoRA to reduce memory requirements:- Low-Rank Adaptation: Instead of fine-tuning all model parameters, LoRA adds trainable low-rank matrices to attention layers
- Memory Efficient: Typically requires 3-4x less GPU memory than full fine-tuning
- Fast Training: Fewer parameters to update means faster training iterations
- Modular: LoRA adapters can be easily swapped or combined
LoRA Configuration
The LoRA configuration is defined using PEFT’sLoraConfig:
Custom Training Components
CustomTrainer
The training script uses aCustomTrainer class (defined in trainer.py) that extends HuggingFace’s Trainer with:
- Core Loss Tracking: Tracks loss on “core” tokens (important chemical entities)
- Core Mask Support: Uses
core_maskto weight loss on specific tokens - Custom Data Collation: Handles chemistry-specific data formatting
Data Processing
Thegenerate_and_tokenize_prompt() function:
- Generates chat-formatted prompts from input/output pairs
- Tokenizes with truncation to
cutoff_len - Optionally masks input tokens (when
train_on_inputs=False) - Generates core masks for important chemical tokens
Usage Examples
Basic Training
Training with W&B Logging
Custom LoRA Configuration
Resume from Checkpoint
Task-Specific Training
File Location
LLM4Chem/finetune.py
Related Components
- trainer.py:
CustomTrainerandCustomDataCollatorclasses - model.py: Model loading utilities
- config.py: Task configurations and base model mappings
- utils/general_prompter.py: Prompt formatting
- utils/core_tagger.py: Core token masking for chemistry entities
For best results with chemistry tasks, use the default LoRA configuration which has been optimized for molecular understanding and generation.