Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/pranavkrishnasuresh/chemAgent/llms.txt

Use this file to discover all available pages before exploring further.

LlaSMol (Large Language Model for Small Molecules) is a family of fine-tuned models specifically trained for chemistry tasks using the SMolInstruct dataset. ChemAgent uses LlaSMol as its core chemistry reasoning engine.

Overview

LlaSMol models are instruction-tuned on SMolInstruct, a comprehensive dataset covering 14 essential chemistry tasks across 4 major categories.

Model Variants

All LlaSMol models are available on Hugging Face:

LlaSMol-Mistral-7B

Recommended - Best overall performanceosunlp/LlaSMol-Mistral-7B

LlaSMol-Llama2-7B

Meta’s Llama2 baseosunlp/LlaSMol-Llama2-7B

LlaSMol-CodeLlama-7B

Code-specialized baseosunlp/LlaSMol-CodeLlama-7B

LlaSMol-Galactica-6.7B

Science-focused baseosunlp/LlaSMol-Galactica-6.7B

Model Initialization

ChemAgent uses the Mistral variant by default:
from LLM4Chem.generation import LlaSMolGeneration

generator = LlaSMolGeneration(
    "osunlp/LlaSMol-Mistral-7B", 
    device="cuda"
)
See plan_execute_agent/chem_tools.py:118 for integration.

Hardware Requirements

LlaSMol models require GPU acceleration and sufficient VRAM:
  • Minimum: 8GB GPU memory
  • Recommended: 16GB+ for optimal performance
  • CPU-only: Set LOW_VRAM=True in configuration to disable

Supported Tasks

LlaSMol is trained on 14 chemistry tasks across 4 categories:

1. Name Conversion (4 tasks)

Converts between different molecular representations:
Task: Convert IUPAC name to molecular formula
query = "What is the molecular formula of <IUPAC> 2,5-diphenyl-1,3-oxazole </IUPAC>?"
response = generator.generate(query)
# Output: "<MOLFORMULA> C15H11NO </MOLFORMULA>"
See LLM4Chem/README.md:23 for more examples.

2. Property Prediction (6 tasks)

Predicts molecular properties from SMILES:
Predicts log solubility in mol/L:
query = "How soluble is <SMILES> CC(C)Cl </SMILES>?"
response = generator.generate(query)
# Output: "Its log solubility is <NUMBER> -1.41 </NUMBER> mol/L."
See LLM4Chem/README.md:52 for details.
Predicts octanol/water distribution coefficient (logD at pH 7.4):
query = "Predict the logD for <SMILES> NC(=O)C1=CC=CC=C1O </SMILES>."
response = generator.generate(query)
# Output: "<NUMBER> 1.090 </NUMBER>"
See LLM4Chem/README.md:59 for details.
Predicts if molecule can penetrate BBB (boolean):
query = "Is BBBP a property of <SMILES> CCNC(=O)/C=C/C1=CC=CC(Br)=C1 </SMILES>?"
response = generator.generate(query)
# Output: "<BOOLEAN> Yes </BOOLEAN>"
See LLM4Chem/README.md:66 for details.
Predicts if molecule is toxic (boolean):
query = "Is <SMILES> COC[C@@H](NC(C)=O)C(=O)NCC1=CC=CC=C1 </SMILES> toxic?"
response = generator.generate(query)
# Output: "<BOOLEAN> No </BOOLEAN>"
See LLM4Chem/README.md:73 for details.
Predicts if molecule inhibits HIV replication (boolean):
query = "Can <SMILES> CC1=CN(C2C=CCCC2O)C(=O)NC1=O </SMILES> inhibit HIV?"
response = generator.generate(query)
# Output: "<BOOLEAN> No </BOOLEAN>"
See LLM4Chem/README.md:80 for details.
Predicts organ-specific side effects (boolean):
query = "Are there side effects of <SMILES> CC1=CC(C)=C(NC(=O)CN(CC(=O)O)CC(=O)O)C(C)=C1Br </SMILES> affecting the heart?"
response = generator.generate(query)
# Output: "<BOOLEAN> No </BOOLEAN>"
See LLM4Chem/README.md:87 for details.

3. Molecule Description (2 tasks)

Generates or interprets molecular descriptions:

Molecule Captioning

Describes a molecule from its SMILES:
query = "Describe this molecule: <SMILES> CCOC(=O)C1=CN=CN1[C@H](C)C1=CC=CC=C1 </SMILES>"
response = generator.generate(query)
# Output: "The molecule is an imidazole derivative with short-acting sedative, 
#          hypnotic, and general anesthetic properties. Etomidate appears to have 
#          gamma-aminobutyric acid (GABA) like effects..."
See LLM4Chem/README.md:96 for examples.

Molecule Generation

Generates SMILES from a text description:
query = """Give me a molecule that satisfies: The molecule is a member of the class of 
tripyrroles that is a red-coloured pigment with antibiotic properties produced by Serratia 
marcescens. It has a role as an antimicrobial agent..."""

response = generator.generate(query)
# Output: "Here is a potential molecule: <SMILES> CCCCCC1=C(C)NC(/C=C2\N=C(C3=CC=CN3)C=C2OC)=C1 </SMILES>"
For molecule generation, tags are not required in the input description.
See LLM4Chem/README.md:103 for examples.

4. Chemical Reactions (2 tasks)

Predicts reaction products or reactants:
Predicts products from reactants:
query = "<SMILES> NC1=CC=C2OCOC2=C1.O=CO </SMILES> Based on the reactants and reagents given above, suggest a possible product."
response = generator.generate(query)
# Output: "A possible product can be <SMILES> O=CNC1=CC=C2OCOC2=C1 </SMILES>."
See LLM4Chem/README.md:115 for examples.

SMolInstruct Dataset

LlaSMol models are trained on SMolInstruct, a large-scale chemistry instruction dataset:

Key Features

  • Scale: Millions of instruction-response pairs
  • Coverage: All 14 chemistry tasks with balanced distribution
  • Quality: Curated and validated chemical data
  • Format: Instruction-tuning format with special tags

Training Details

Fine-tuning uses LoRA (Low-Rank Adaptation):
MODELNAME=LlaSMol-Mistral-7B
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch finetune.py \
  --data_path osunlp/SMolInstruct \
  --base_model mistralai/Mistral-7B-v0.1 \
  --output_dir checkpoint/$MODELNAME
See LLM4Chem/README.md:131 for training instructions.

Tag System

LlaSMol uses specialized tags to structure chemistry information:

Input Tags

<SMILES>
string
Wraps SMILES representations in queriesExample: <SMILES> CC(C)Cl </SMILES>
<IUPAC>
string
Wraps IUPAC names in queriesExample: <IUPAC> aspirin </IUPAC>

Output Tags

<MOLFORMULA>
string
Molecular formula in model responsesExample: <MOLFORMULA> C9H8O4 </MOLFORMULA>
<NUMBER>
string
Numerical predictions (solubility, logD, etc.)Example: <NUMBER> -1.41 </NUMBER>
<BOOLEAN>
string
Yes/No predictions (toxicity, BBB, etc.)Example: <BOOLEAN> Yes </BOOLEAN>
See LLM4Chem/README.md:157 for complete tag documentation.

SMILES Canonicalization

LlaSMol automatically canonicalizes SMILES strings using RDKit:
from rdkit import Chem

def canonicalize_smiles(smiles: str) -> str:
    """Convert SMILES to canonical form."""
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return smiles
    return Chem.MolToSmiles(mol)
Why canonicalization matters:
  • CCO and OCC represent the same molecule (ethanol)
  • Canonical form ensures consistent training and inference
  • Improves model accuracy and reduces ambiguity
Canonicalization happens automatically when SMILES are wrapped in <SMILES> tags.
See LLM4Chem/README.md:168 for details.

Usage in ChemAgent

Direct Generation

The answer_chemistry_query tool wraps LlaSMol:
@tool
def answer_chemistry_query(query: str) -> str:
    """Answer a chemistry-related query using LlaSMol."""
    response = generator.generate(query)
    return response[0]["output"][0]
See plan_execute_agent/chem_tools.py:124 for implementation.

Query Format

Queries must be properly tagged before being sent to LlaSMol:
# ❌ Incorrect - no tags
query = "What is the SMILES for aspirin?"

# ✅ Correct - IUPAC tagged
query = "What is the SMILES for <IUPAC> aspirin </IUPAC>?"
The structure_chem_prompt tool handles automatic tagging.

Response Handling

LlaSMol responses are stored for validation and error tracking:
import plan_execute_agent.llasmol_response as llasmol_response

response = generator.generate(query)
llasmol_response.model_response = response  # Store for later access
return response[0]["output"][0]
See plan_execute_agent/chem_tools.py:159 for response management.

Performance Characteristics

Accuracy
metrics
Performance varies by task:
  • Name conversions: 80-90% exact match
  • Property predictions: 60-85% task-dependent
  • Molecule captioning: Qualitative, high coherence
  • Reaction prediction: 50-70% exact match
Latency
timing
  • Single query: 2-5 seconds on GPU
  • Batch processing: ~1 second per query
  • First load: +10 seconds model initialization
Memory
resources
  • Model size: ~14GB (7B parameters)
  • Peak memory: ~16GB during inference
  • Batch size 1: ~8GB VRAM minimum

Evaluation Pipeline

LlaSMol includes tools for evaluation on SMolInstruct:

Step 1: Generate Responses

python generate_on_dataset.py \
  --model_name osunlp/LlaSMol-Mistral-7B \
  --output_dir eval/LlaSMol-Mistral-7B/output

Step 2: Extract Predictions

python extract_prediction.py \
  --output_dir eval/LlaSMol-Mistral-7B/output \
  --prediction_dir eval/LlaSMol-Mistral-7B/prediction

Step 3: Compute Metrics

python compute_metrics.py \
  --prediction_dir eval/LlaSMol-Mistral-7B/prediction
See LLM4Chem/README.md:172 for evaluation documentation.

Limitations

Known Limitations:
  1. Task Scope: Only supports the 14 trained tasks
  2. Complex Queries: May struggle with multi-step reasoning
  3. Novel Compounds: Less accurate for molecules not in training data
  4. Numerical Precision: Property predictions are approximate
  5. Context Length: Limited to standard transformer context window

Next Steps

Chemistry Tags

Learn the tag system in detail

Agent Workflow

See how LlaSMol fits into the workflow

Build docs developers (and LLMs) love