Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/pranavkrishnasuresh/chemAgent/llms.txt

Use this file to discover all available pages before exploring further.

Overview

The LlaSMol evaluation pipeline consists of three main steps:
  1. Generation: Generate predictions on test datasets
  2. Extraction: Extract and format predictions from model outputs
  3. Metrics: Compute task-specific evaluation metrics

Evaluation Pipeline

1. Generate Predictions

Use generate_on_dataset.py to generate predictions for a specific task.
python generate_on_dataset.py \
  --model_name="osunlp/LlaSMol-Mistral-7B" \
  --data_path="osunlp/SMolInstruct" \
  --split="test" \
  --tasks="name_conversion-i2s" \
  --output_dir="./eval" \
  --batch_size=1

Parameters

model_name
string
required
Name or path of the fine-tuned LlaSMol model.
base_model
string
default:"None"
Base model architecture. Auto-detected from model_name if not specified.
data_path
string
default:"osunlp/SMolInstruct"
Dataset to evaluate on (Hugging Face dataset or local path).
split
string
default:"test[:2]"
Dataset split to evaluate. Use slice notation (e.g., test[:100]) to evaluate on a subset.
tasks
list[string]
default:"None"
Tasks to evaluate. If None, evaluates on all available tasks.
output_dir
string
default:"eval"
Directory where prediction files will be saved (one .jsonl file per task).
batch_size
int
default:"1"
Number of samples to process in parallel.
max_input_tokens
int
default:"None"
Maximum input token length. Uses task-specific or default settings if None.
max_new_tokens
int
default:"None"
Maximum tokens to generate. Uses task-specific or default settings if None.
print_out
bool
default:"False"
Whether to print predictions during generation.
device
string
default:"None"
Device to run on (cuda or cpu). Auto-detected if None.
**generation_kargs
dict
Additional generation parameters (e.g., num_beams, num_return_sequences).

Output Format

Generates .jsonl files in the output directory, one per task:
{
  "input": "Convert IUPAC name to SMILES: benzene",
  "gold": "c1ccccc1",
  "output": ["c1ccccc1"],
  "task": "name_conversion-i2s",
  "split": "test",
  "target": null,
  "input_text": "Convert IUPAC name to SMILES: benzene",
  "real_input_text": "[INST] Convert IUPAC name to SMILES: benzene [/INST]"
}

2. Extract Predictions

Use extract_prediction.py to extract predictions from generation outputs.
python extract_prediction.py \
  --output_dir="./eval" \
  --prediction_dir="./pred" \
  --tasks="name_conversion-i2s"

Parameters

output_dir
string
default:"eval"
Directory containing generation outputs from step 1.
prediction_dir
string
default:"pred"
Directory where extracted predictions will be saved.
tasks
list[string] | string
default:"None"
Tasks to process. If None, processes all tasks found in output_dir.

Functionality

The extraction script:
  • Reads generation outputs from output_dir
  • Extracts predictions based on task-specific tags (if configured)
  • Saves formatted predictions to prediction_dir
  • Handles multiple prediction sequences per input

Output Format

Creates .jsonl files in prediction_dir:
{
  "input": "Convert IUPAC name to SMILES: benzene",
  "gold": "c1ccccc1",
  "pred": ["c1ccccc1"],
  "task": "name_conversion-i2s",
  "split": "test",
  "target": null
}

3. Compute Metrics

Use compute_metrics.py to calculate evaluation metrics.
python compute_metrics.py \
  --prediction_dir="./pred" \
  --tasks="name_conversion-i2s"

Parameters

prediction_dir
string
default:"pred"
Directory containing prediction files from step 2.
tasks
list[string]
default:"all tasks from config.TASKS"
Tasks to compute metrics for.

Task-Specific Metrics

Different tasks use different evaluation metrics: SMILES Tasks (forward_synthesis, molecule_generation, name_conversion-i2s):
  • Exact Match
  • Validity
  • Fingerprint Similarity (Tanimoto)
  • MACCS FTS
  • RDK FTS
  • Morgan FTS
Retrosynthesis:
  • Exact Match
  • Fingerprint Similarity
  • Multiple Match (top-k accuracy)
Text Tasks (molecule_captioning):
  • BLEU-2, BLEU-4
  • ROUGE-1, ROUGE-2, ROUGE-L
  • METEOR
Formula Tasks (name_conversion-i2f, name_conversion-s2f):
  • Element Match: Compares molecular formulas element-by-element
IUPAC Name Tasks (name_conversion-s2i):
  • Split Match: Tokenized comparison of IUPAC names
Numerical Property Prediction (property_prediction-esol, property_prediction-lipo):
  • MAE (Mean Absolute Error)
  • RMSE (Root Mean Square Error)
Boolean Property Prediction (property_prediction-bbbp, property_prediction-clintox, property_prediction-hiv, property_prediction-sider):
  • Accuracy
  • Precision
  • Recall
  • F1 Score

Output Format

Prints metrics to console:
===== name_conversion-i2s =====
Altogether 1000 samples.
exact_match:  0.8523
validity:     0.9876
maccs_fts:    0.8912
rdk_fts:      0.8745
morgan_fts:   0.8834

Supported Tasks

The following chemistry tasks are supported:

Forward Synthesis

Predict reaction products from reactants

Retrosynthesis

Predict reactants from products

Molecule Captioning

Generate text descriptions of molecules

Molecule Generation

Generate SMILES from text descriptions

IUPAC to Formula

Convert IUPAC names to molecular formulas

IUPAC to SMILES

Convert IUPAC names to SMILES

SMILES to Formula

Convert SMILES to molecular formulas

SMILES to IUPAC

Convert SMILES to IUPAC names

Property Prediction

Predict molecular properties (ESOL, LIPO, BBBP, etc.)

Task Configuration

Task-specific generation settings are defined in config.py:
TASKS_GENERATION_SETTINGS = {
    "name_conversion-i2s": {
        "generation_kargs": {"num_return_sequences": 5, "num_beams": 8}
    },
    "forward_synthesis": {
        "generation_kargs": {"num_return_sequences": 5, "num_beams": 8}
    },
    "retrosynthesis": {
        "max_new_tokens": 960,
        "generation_kargs": {"num_return_sequences": 10, "num_beams": 13}
    },
    ...
}

Complete Evaluation Workflow

Example: Evaluate on Name Conversion

# Step 1: Generate predictions
python generate_on_dataset.py \
  --model_name="osunlp/LlaSMol-Mistral-7B" \
  --tasks="name_conversion-i2s" \
  --split="test" \
  --output_dir="./eval"

# Step 2: Extract predictions
python extract_prediction.py \
  --output_dir="./eval" \
  --prediction_dir="./pred" \
  --tasks="name_conversion-i2s"

# Step 3: Compute metrics
python compute_metrics.py \
  --prediction_dir="./pred" \
  --tasks="name_conversion-i2s"

Example: Evaluate All Tasks

# Generate for all tasks
python generate_on_dataset.py \
  --model_name="osunlp/LlaSMol-Mistral-7B" \
  --split="test" \
  --output_dir="./eval"

# Extract all predictions
python extract_prediction.py \
  --output_dir="./eval" \
  --prediction_dir="./pred"

# Compute all metrics
python compute_metrics.py \
  --prediction_dir="./pred"

Example: Evaluate with Custom Settings

python generate_on_dataset.py \
  --model_name="osunlp/LlaSMol-Mistral-7B" \
  --tasks="forward_synthesis" \
  --split="test" \
  --output_dir="./eval" \
  --batch_size=8 \
  --num_beams=10 \
  --num_return_sequences=5

Programmatic Usage

Generate Function

from generate_on_dataset import generate
from generation import LlaSMolGeneration

generator = LlaSMolGeneration("osunlp/LlaSMol-Mistral-7B")

generate(
    generator=generator,
    data_path="osunlp/SMolInstruct",
    split="test",
    task="name_conversion-i2s",
    output_file="./eval/name_conversion-i2s.jsonl",
    batch_size=1,
    print_out=True
)

Extract Predictions

from extract_prediction import extract_prediction

extract_prediction(
    output_file="./eval/name_conversion-i2s.jsonl",
    prediction_file="./pred/name_conversion-i2s.jsonl",
    task="name_conversion-i2s"
)

Compute Metrics

from compute_metrics import read_result
from utils.metrics import calculate_smiles_metrics

pred_list, gold_list = read_result(
    prediction_dir="./pred",
    task="name_conversion-i2s",
    replace_semicolon=True
)

metrics = calculate_smiles_metrics(pred_list, gold_list)
print(metrics)

Metrics Utilities

The utils/metrics.py module provides metric calculation functions:
  • calculate_smiles_metrics(): SMILES validity, exact match, fingerprint similarity
  • calculate_text_metrics(): BLEU, ROUGE, METEOR for text generation
  • calculate_formula_metrics(): Element matching for molecular formulas
  • calculate_number_metrics(): MAE, RMSE for numerical predictions
  • calculate_boolean_metrics(): Accuracy, precision, recall, F1 for binary classification

File Locations

  • LLM4Chem/generate_on_dataset.py: Dataset generation script
  • LLM4Chem/extract_prediction.py: Prediction extraction script
  • LLM4Chem/compute_metrics.py: Metrics computation script
  • LLM4Chem/config.py: Task configurations
  • LLM4Chem/utils/metrics.py: Metric calculation functions
The evaluation pipeline automatically handles task-specific settings from config.py, including generation parameters and metric selection.
When evaluating on large datasets, consider using batch processing and distributing generation across multiple GPUs to reduce evaluation time.

Build docs developers (and LLMs) love