LlaSMol (Large Language Model for Small Molecules) is a family of fine-tuned models specifically trained for chemistry tasks using the SMolInstruct dataset. ChemAgent uses LlaSMol as its core chemistry reasoning engine.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/pranavkrishnasuresh/chemAgent/llms.txt
Use this file to discover all available pages before exploring further.
Overview
LlaSMol models are instruction-tuned on SMolInstruct, a comprehensive dataset covering 14 essential chemistry tasks across 4 major categories.
Model Variants
All LlaSMol models are available on Hugging Face:LlaSMol-Mistral-7B
Recommended - Best overall performance
osunlp/LlaSMol-Mistral-7BLlaSMol-Llama2-7B
Meta’s Llama2 base
osunlp/LlaSMol-Llama2-7BLlaSMol-CodeLlama-7B
Code-specialized base
osunlp/LlaSMol-CodeLlama-7BLlaSMol-Galactica-6.7B
Science-focused base
osunlp/LlaSMol-Galactica-6.7BModel Initialization
ChemAgent uses the Mistral variant by default:Hardware Requirements
Supported Tasks
LlaSMol is trained on 14 chemistry tasks across 4 categories:1. Name Conversion (4 tasks)
Converts between different molecular representations:- IUPAC → Formula
- IUPAC → SMILES
- SMILES → Formula
- SMILES → IUPAC
Task: Convert IUPAC name to molecular formulaSee LLM4Chem/README.md:23 for more examples.
2. Property Prediction (6 tasks)
Predicts molecular properties from SMILES:ESOL - Aqueous Solubility
ESOL - Aqueous Solubility
Predicts log solubility in mol/L:See LLM4Chem/README.md:52 for details.
LIPO - Lipophilicity
LIPO - Lipophilicity
Predicts octanol/water distribution coefficient (logD at pH 7.4):See LLM4Chem/README.md:59 for details.
BBBP - Blood-Brain Barrier
BBBP - Blood-Brain Barrier
Predicts if molecule can penetrate BBB (boolean):See LLM4Chem/README.md:66 for details.
Clintox - Toxicity
Clintox - Toxicity
Predicts if molecule is toxic (boolean):See LLM4Chem/README.md:73 for details.
HIV - HIV Inhibition
HIV - HIV Inhibition
Predicts if molecule inhibits HIV replication (boolean):See LLM4Chem/README.md:80 for details.
SIDER - Side Effects
SIDER - Side Effects
Predicts organ-specific side effects (boolean):See LLM4Chem/README.md:87 for details.
3. Molecule Description (2 tasks)
Generates or interprets molecular descriptions:Molecule Captioning
Describes a molecule from its SMILES:Molecule Generation
Generates SMILES from a text description:For molecule generation, tags are not required in the input description.
4. Chemical Reactions (2 tasks)
Predicts reaction products or reactants:- Forward Synthesis
- Retrosynthesis
Predicts products from reactants:See LLM4Chem/README.md:115 for examples.
SMolInstruct Dataset
LlaSMol models are trained on SMolInstruct, a large-scale chemistry instruction dataset:Key Features
- Scale: Millions of instruction-response pairs
- Coverage: All 14 chemistry tasks with balanced distribution
- Quality: Curated and validated chemical data
- Format: Instruction-tuning format with special tags
Training Details
Fine-tuning uses LoRA (Low-Rank Adaptation):Tag System
LlaSMol uses specialized tags to structure chemistry information:Input Tags
Wraps SMILES representations in queriesExample:
<SMILES> CC(C)Cl </SMILES>Wraps IUPAC names in queriesExample:
<IUPAC> aspirin </IUPAC>Output Tags
Molecular formula in model responsesExample:
<MOLFORMULA> C9H8O4 </MOLFORMULA>Numerical predictions (solubility, logD, etc.)Example:
<NUMBER> -1.41 </NUMBER>Yes/No predictions (toxicity, BBB, etc.)Example:
<BOOLEAN> Yes </BOOLEAN>SMILES Canonicalization
LlaSMol automatically canonicalizes SMILES strings using RDKit:CCOandOCCrepresent the same molecule (ethanol)- Canonical form ensures consistent training and inference
- Improves model accuracy and reduces ambiguity
Canonicalization happens automatically when SMILES are wrapped in
<SMILES> tags.Usage in ChemAgent
Direct Generation
Theanswer_chemistry_query tool wraps LlaSMol:
Query Format
Queries must be properly tagged before being sent to LlaSMol:structure_chem_prompt tool handles automatic tagging.
Response Handling
LlaSMol responses are stored for validation and error tracking:Performance Characteristics
Performance varies by task:
- Name conversions: 80-90% exact match
- Property predictions: 60-85% task-dependent
- Molecule captioning: Qualitative, high coherence
- Reaction prediction: 50-70% exact match
- Single query: 2-5 seconds on GPU
- Batch processing: ~1 second per query
- First load: +10 seconds model initialization
- Model size: ~14GB (7B parameters)
- Peak memory: ~16GB during inference
- Batch size 1: ~8GB VRAM minimum
Evaluation Pipeline
LlaSMol includes tools for evaluation on SMolInstruct:Step 1: Generate Responses
Step 2: Extract Predictions
Step 3: Compute Metrics
Limitations
Next Steps
Chemistry Tags
Learn the tag system in detail
Agent Workflow
See how LlaSMol fits into the workflow