Training

Overview

The LlaSMol training script uses LoRA (Low-Rank Adaptation) to efficiently fine-tune large language models on chemistry tasks. It supports distributed training, Weights & Biases integration, and custom data collation.

Running Training

Basic Command

python finetune.py \
  --base_model="mistralai/Mistral-7B-v0.1" \
  --data_path="osunlp/SMolInstruct" \
  --output_dir="./checkpoints/llasmol" \
  --batch_size=512 \
  --micro_batch_size=4 \
  --num_epochs=3 \
  --learning_rate=1e-4

Distributed Training

torchrun --nproc_per_node=4 finetune.py \
  --base_model="mistralai/Mistral-7B-v0.1" \
  --data_path="osunlp/SMolInstruct" \
  --output_dir="./checkpoints/llasmol" \
  --batch_size=512 \
  --micro_batch_size=4

Training Parameters

Model & Data

base_model

string

required

The base model to fine-tune. Supported models:

mistralai/Mistral-7B-v0.1
facebook/galactica-6.7b
meta-llama/Llama-2-7b-hf
codellama/CodeLlama-7b-hf

data_path

string

required

Path to the training dataset (Hugging Face dataset path or local path).

output_dir

string

default:"checkpoint"

Directory where model checkpoints will be saved.

train_split

string

default:"train"

Name of the training split in the dataset.

dev_split

string

default:"validation"

Name of the validation split in the dataset.

tasks

list[string]

default:"None"

Specific tasks to train on. If None, trains on all tasks in the dataset.

Training Hyperparameters

batch_size

int

default:"512"

Total batch size across all devices (effective batch size).

micro_batch_size

int

default:"4"

Batch size per device. Gradient accumulation steps are calculated as batch_size / micro_batch_size.

num_epochs

int

default:"3"

Number of training epochs.

learning_rate

float

default:"1e-4"

Learning rate for the optimizer.

cutoff_len

int

default:"512"

Maximum sequence length. Longer sequences will be truncated.

optim

string

default:"adamw_bnb_8bit"

Optimizer to use. Options include:

adamw_bnb_8bit: 8-bit AdamW (memory efficient)
adamw_torch: Standard PyTorch AdamW
Other HuggingFace-supported optimizers

lr_scheduler

string

default:"cosine"

Learning rate scheduler type. Options: cosine, linear, constant, etc.

warmup_steps

int

default:"1000"

Number of warmup steps for the learning rate scheduler.

use_val_set

bool

default:"True"

Whether to use a validation set for evaluation during training.

train_on_inputs

bool

default:"True"

If False, masks out inputs in the loss calculation (only compute loss on outputs).

group_by_length

bool

default:"False"

Whether to group training samples by length. Can speed up training but may produce odd loss curves.

LoRA Hyperparameters

lora_r

int

default:"16"

LoRA rank (dimensionality of the low-rank matrices).

lora_alpha

int

default:"16"

LoRA scaling parameter.

lora_dropout

float

default:"0.05"

Dropout probability for LoRA layers.

lora_target_modules

list[string]

Model modules to apply LoRA to. Common options:

Query/Key/Value projections: q_proj, k_proj, v_proj, o_proj
MLP layers: gate_proj, down_proj, up_proj
Embeddings: wte

modules_to_save

list[string]

default:"[]"

Additional modules to save (not using LoRA). Useful for saving embeddings or classification heads.

Weights & Biases Integration

wandb_project

string

default:""

Weights & Biases project name for logging.

wandb_run_name

string

default:""

Name for this specific W&B run.

wandb_watch

string

default:""

W&B watch mode. Options: false, gradients, all.

wandb_log_model

string

default:""

Whether to log model checkpoints to W&B. Options: false, true.

Checkpointing

resume_from_checkpoint

string

default:"None"

Path to a checkpoint to resume training from.

logging_steps

int

default:"10"

Log metrics every N steps.

save_steps

int

default:"200"

Save checkpoint every N steps.

eval_steps

int

default:"200"

Run evaluation every N steps (if validation set is enabled).

save_total_limit

int

default:"None"

Maximum number of checkpoints to keep. Older checkpoints are deleted.

Precision & Device

precision

string

default:"bf16"

Training precision. Currently only bf16 (bfloat16) is tested and supported.

use_int8

bool

default:"False"

Whether to load the model in 8-bit mode (requires bitsandbytes).

Training Function

from finetune import train

train(
    base_model="mistralai/Mistral-7B-v0.1",
    data_path="osunlp/SMolInstruct",
    output_dir="./checkpoints/llasmol",
    batch_size=512,
    micro_batch_size=4,
    num_epochs=3,
    learning_rate=1e-4,
    lora_r=16,
    lora_alpha=16,
    wandb_project="llasmol-training",
    wandb_run_name="mistral-7b-chemistry"
)

LoRA Fine-Tuning

LlaSMol uses Parameter-Efficient Fine-Tuning (PEFT) with LoRA to reduce memory requirements:

Low-Rank Adaptation: Instead of fine-tuning all model parameters, LoRA adds trainable low-rank matrices to attention layers
Memory Efficient: Typically requires 3-4x less GPU memory than full fine-tuning
Fast Training: Fewer parameters to update means faster training iterations
Modular: LoRA adapters can be easily swapped or combined

LoRA Configuration

The LoRA configuration is defined using PEFT’s LoraConfig:

from peft import LoraConfig

config = LoraConfig(
    r=16,                    # Rank of LoRA matrices
    lora_alpha=16,           # Scaling factor
    target_modules=[         # Modules to apply LoRA to
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "down_proj", "up_proj"
    ],
    lora_dropout=0.05,       # Dropout for regularization
    bias="none",             # Don't train bias parameters
    task_type="CAUSAL_LM"    # Causal language modeling
)

Custom Training Components

CustomTrainer

The training script uses a CustomTrainer class (defined in trainer.py) that extends HuggingFace’s Trainer with:

Core Loss Tracking: Tracks loss on “core” tokens (important chemical entities)
Core Mask Support: Uses core_mask to weight loss on specific tokens
Custom Data Collation: Handles chemistry-specific data formatting

Data Processing

The generate_and_tokenize_prompt() function:

Generates chat-formatted prompts from input/output pairs
Tokenizes with truncation to cutoff_len
Optionally masks input tokens (when train_on_inputs=False)
Generates core masks for important chemical tokens

def generate_and_tokenize_prompt(data_point, add_core_mask=True):
    input_text = data_point['input']
    output_text = data_point['output']
    
    # Create chat format
    chat = generate_chat(input_text, output_text)
    full_prompt = prompter.generate_prompt(chat)
    
    # Tokenize
    tokenized = tokenize(full_prompt)
    
    # Mask inputs if specified
    if not train_on_inputs:
        user_prompt_len = len(tokenize(input_only_prompt))
        tokenized["labels"] = [-100] * user_prompt_len + tokenized["labels"][user_prompt_len:]
    
    return tokenized

Usage Examples

Basic Training

python finetune.py \
  --base_model="mistralai/Mistral-7B-v0.1" \
  --data_path="osunlp/SMolInstruct" \
  --output_dir="./output/mistral-chemistry" \
  --num_epochs=3

Training with W&B Logging

python finetune.py \
  --base_model="mistralai/Mistral-7B-v0.1" \
  --data_path="osunlp/SMolInstruct" \
  --output_dir="./output/mistral-chemistry" \
  --wandb_project="chemistry-llm" \
  --wandb_run_name="mistral-llasmol-v1" \
  --wandb_watch="gradients"

Custom LoRA Configuration

python finetune.py \
  --base_model="mistralai/Mistral-7B-v0.1" \
  --data_path="osunlp/SMolInstruct" \
  --output_dir="./output/mistral-chemistry" \
  --lora_r=32 \
  --lora_alpha=64 \
  --lora_dropout=0.1 \
  --lora_target_modules q_proj k_proj v_proj o_proj

Resume from Checkpoint

python finetune.py \
  --base_model="mistralai/Mistral-7B-v0.1" \
  --data_path="osunlp/SMolInstruct" \
  --output_dir="./output/mistral-chemistry" \
  --resume_from_checkpoint="./output/mistral-chemistry/checkpoint-1000"

Task-Specific Training

python finetune.py \
  --base_model="mistralai/Mistral-7B-v0.1" \
  --data_path="osunlp/SMolInstruct" \
  --output_dir="./output/name-conversion" \
  --tasks name_conversion-i2s name_conversion-s2i

File Location

LLM4Chem/finetune.py

trainer.py: CustomTrainer and CustomDataCollator classes
model.py: Model loading utilities
config.py: Task configurations and base model mappings
utils/general_prompter.py: Prompt formatting
utils/core_tagger.py: Core token masking for chemistry entities

The training script uses torch.distributed and requires proper initialization. Make sure to use torchrun for multi-GPU training.

For best results with chemistry tasks, use the default LoRA configuration which has been optimized for molecular understanding and generation.

Agent API

Chemistry Tools

PubChem RAG

LlaSMol

Overview

Running Training

Basic Command

Distributed Training

Training Parameters

Model & Data

Training Hyperparameters

LoRA Hyperparameters

Weights & Biases Integration

Checkpointing

Precision & Device

Training Function

LoRA Fine-Tuning

LoRA Configuration

Custom Training Components

CustomTrainer

Data Processing

Usage Examples

Basic Training

Training with W&B Logging

Custom LoRA Configuration

Resume from Checkpoint

Task-Specific Training

File Location

Build docs developers (and LLMs) love

Agent API

Chemistry Tools

PubChem RAG

LlaSMol

Documentation Index

​Overview

​Running Training

​Basic Command

​Distributed Training

​Training Parameters

​Model & Data

​Training Hyperparameters

​LoRA Hyperparameters

​Weights & Biases Integration

​Checkpointing

​Precision & Device

​Training Function

​LoRA Fine-Tuning

​LoRA Configuration

​Custom Training Components

​CustomTrainer

​Data Processing

​Usage Examples

​Basic Training

​Training with W&B Logging

​Custom LoRA Configuration

​Resume from Checkpoint

​Task-Specific Training

​File Location

​Related Components

Build docs developers (and LLMs) love

Overview

Running Training

Basic Command

Distributed Training

Training Parameters

Model & Data

Training Hyperparameters

LoRA Hyperparameters

Weights & Biases Integration

Checkpointing

Precision & Device

Training Function

LoRA Fine-Tuning

LoRA Configuration

Custom Training Components

CustomTrainer

Data Processing

Usage Examples

Basic Training

Training with W&B Logging

Custom LoRA Configuration

Resume from Checkpoint

Task-Specific Training

File Location

Related Components