Documentation Index Fetch the complete documentation index at: https://mintlify.com/pranavkrishnasuresh/chemAgent/llms.txt
Use this file to discover all available pages before exploring further.
ChemAgent uses a specialized tag system to precisely identify and structure chemical information in queries. Understanding these tags is essential for effective interaction with the system.
Tags serve three critical purposes:
Precision Clearly identify which strings are chemical entities vs. natural language
Context Inform the LlaSMol model about the type of chemical information provided
Parsing Enable automatic SMILES canonicalization and validation
Two tags are used to wrap chemical information in queries :
<SMILES> Tag
Wraps SMILES (Simplified Molecular Input Line Entry System) strings:
Usage :
query = "What is the molecular formula of <SMILES> CC(C)Cl </SMILES>?"
Characteristics :
Triggers automatic canonicalization via RDKit
Must contain valid SMILES syntax
Can represent any molecular structure
See plan_execute_agent/chem_tools.py:26 for SMILES handling.
<IUPAC> Tag
Wraps IUPAC (International Union of Pure and Applied Chemistry) names:
Usage :
query = "Provide the SMILES for <IUPAC> 4-ethyl-4-methyloxolan-2-one </IUPAC>."
Characteristics :
Accepts systematic IUPAC names
Also accepts common names (aspirin, caffeine, etc.)
Whitespace and capitalization are preserved
See plan_execute_agent/chem_tools.py:22 for IUPAC handling.
Three tags appear in model responses from LlaSMol:
Wraps molecular formulas:
< MOLFORMULA > C9H8O4 </ MOLFORMULA >
Example Response :
Query: "What is the molecular formula of <IUPAC> aspirin </IUPAC>?"
Response: "<MOLFORMULA> C9H8O4 </MOLFORMULA>"
See LLM4Chem/README.md:27 for examples.
<NUMBER> Tag
Wraps numerical predictions (solubility, logD, etc.):
Example Response :
Query: "How soluble is <SMILES> CC(C)Cl </SMILES>?"
Response: "Its log solubility is <NUMBER> -1.41 </NUMBER> mol/L."
See LLM4Chem/README.md:56 for examples.
<BOOLEAN> Tag
Wraps yes/no predictions (toxicity, BBB permeability, etc.):
or
Example Response :
Query: "Is <SMILES> COC[C@@H](NC(C)=O)C(=O)NCC1=CC=CC=C1 </SMILES> toxic?"
Response: "<BOOLEAN> No </BOOLEAN>"
See LLM4Chem/README.md:77 for examples.
Automatic Tagging
ChemAgent includes the structure_chem_prompt tool that automatically adds tags to unstructured queries:
How It Works
@tool
def structure_chem_prompt ( original_prompt ):
"""Structure and tag IUPAC or SMILES chemical information."""
response = client.beta.chat.completions.parse(
model = "gpt-4o" ,
messages = [{
"role" : "system" ,
"content" : SYSTEM_TAG_PROMPT ,
}, {
"role" : "user" ,
"content" : f "Structure the input query: { original_prompt } " ,
}],
response_format = StructuredPrompt,
)
return { "new_prompt" : simplified_prompt.get( "new_prompt" )}
See plan_execute_agent/chem_tools.py:57 for implementation.
System Prompt
The tagging tool uses a detailed system prompt:
SYSTEM_TAG_PROMPT = """
You are an EXPERT chemical information tagger. Your task is to format the input query
based on the information below. You MUST return ONLY the formatted input query!
When processing chemical information, use only two tags in the input query:
<SMILES> for SMILES representations and <IUPAC> for IUPAC names.
Tag Definitions:
SMILES: <SMILES> ... </SMILES> for chemical structure in SMILES notation.
IUPAC: <IUPAC> ... </IUPAC> for the IUPAC name of the compound.
Instructions:
1. In the input query, use only the <SMILES> and <IUPAC> tags to wrap the appropriate information.
2. Ensure no extra characters or spaces are present within the tags.
"""
See plan_execute_agent/chem_tools.py:7 for the complete prompt.
Tagging Examples
Input - No Tags
Output - Tagged
"What is the molecular formula of aspirin?"
Input - Mixed
Output - Tagged
"Can you tell me the IUPAC name of C1CCOC1?"
Input - Complex
Output - Tagged
"What is the molecular formula of 2,5-diphenyl-1,3-oxazole and what is the name of C1CCOC1?"
SMILES Canonicalization
When SMILES are wrapped in tags, they are automatically canonicalized using RDKit:
What is Canonicalization?
SMILES strings can represent the same molecule in multiple ways:
# All represent ethanol
"CCO" # Start from carbon
"OCC" # Start from oxygen
"C(O)C" # Explicit branching
Canonicalization converts all variants to a standard form:
from rdkit import Chem
mol = Chem.MolFromSmiles( "OCC" )
canonical = Chem.MolToSmiles(mol)
print (canonical) # Output: "CCO"
Automatic Process
The LlaSMol generation pipeline handles this automatically:
def canonicalize_smiles_in_text ( text : str ) -> str :
"""Find <SMILES> tags and canonicalize their contents."""
pattern = r '<SMILES> \s * ([ ^ < ] + ) \s * </SMILES>'
def replace_smiles ( match ):
smiles = match.group( 1 ).strip()
mol = Chem.MolFromSmiles(smiles)
if mol is None :
return match.group( 0 ) # Keep original if invalid
canonical = Chem.MolToSmiles(mol)
return f "<SMILES> { canonical } </SMILES>"
return re.sub(pattern, replace_smiles, text)
See LLM4Chem/utils/smiles_canonicalization.py for implementation.
Benefits
The model receives standardized input regardless of how users write SMILES
Training and inference use the same canonical form, improving predictions
Canonicalization fails for invalid SMILES, providing early error detection
Here are complete examples showing correct tag usage:
Name Conversion Queries
IUPAC to SMILES
SMILES to IUPAC
SMILES to Formula
IUPAC to Formula
query = "Please provide the SMILES representation for <IUPAC> 4-ethyl-4-methyloxolan-2-one </IUPAC>."
Expected response: "Of course. It's <SMILES> CCC1(C)COC(=O)C1 </SMILES>."
query = "Translate <SMILES> CCC(C)C1CNCCCNC1 </SMILES> to its IUPAC name."
Expected response: "<IUPAC> 3-butan-2-yl-1,5-diazocane </IUPAC>"
query = "What is the molecular formula for <SMILES> S=P1(N(CCCl)CCCl)NCCCO1 </SMILES>?"
Expected response: "It is <MOLFORMULA> C7H15Cl2N2OPS </MOLFORMULA>."
query = "What is the molecular formula of <IUPAC> 2,5-diphenyl-1,3-oxazole </IUPAC>?"
Expected response: "<MOLFORMULA> C15H11NO </MOLFORMULA>"
Property Prediction Queries
Solubility
Toxicity
BBB Permeability
query = "How soluble is <SMILES> CC(C)Cl </SMILES>?"
Expected response: "Its log solubility is <NUMBER> -1.41 </NUMBER> mol/L."
query = "Is <SMILES> COC[C@@H](NC(C)=O)C(=O)NCC1=CC=CC=C1 </SMILES> toxic?"
Expected response: "<BOOLEAN> No </BOOLEAN>"
query = "Is blood-brain barrier permeability a property of <SMILES> CCNC(=O)/C=C/C1=CC=CC(Br)=C1 </SMILES>?"
Expected response: "<BOOLEAN> Yes </BOOLEAN>"
Molecule Description Queries
query = "Describe this molecule: <SMILES> CCOC(=O)C1=CN=CN1[C@H](C)C1=CC=CC=C1 </SMILES>"
Expected response: "The molecule is an imidazole derivative with short-acting sedative, hypnotic,
and general anesthetic properties..."
query = """Give me a molecule that satisfies: The molecule is a red-coloured pigment
with antibiotic properties. It has a role as an antimicrobial agent."""
No tags needed in description for molecule generation
Expected response: "Here is a potential molecule: <SMILES> CCCCCC1=C(C)NC(/C=C2\N=C(C3=CC=CN3)C=C2OC)=C1 </SMILES>"
Tag Validation
The system validates tags at multiple stages:
1. Structure Validation
GPT-4o ensures tags are correctly formatted:
class StructuredPrompt ( BaseModel ):
new_prompt: str # Must contain properly closed tags
See plan_execute_agent/chem_tools.py:51 for validation.
2. SMILES Validation
RDKit validates SMILES content:
@tool
def validate_smiles_rdkit ( smiles_string : str ) -> dict :
"""Validate SMILES with detailed error reporting."""
parsing_details = parse_smiles(smiles_string)
return {
"valid" : parsing_details[ "valid" ],
"error_message" : parsing_details[ "validity_vector" ],
}
See plan_execute_agent/chem_tools.py:180 for validation logic.
The system extracts content between tags in responses:
import re
def extract_smiles ( response : str ) -> str :
"""Extract SMILES from tagged response."""
match = re.search( r '<SMILES> \s * ([ ^ < ] + ) \s * </SMILES>' , response)
return match.group( 1 ).strip() if match else ""
See LLM4Chem/extract_prediction.py for extraction logic.
Common Mistakes
Avoid These Common Errors :
query = "What is the SMILES for aspirin?"
# LlaSMol won't know "aspirin" is a chemical name
✅ Correct: Tagged IUPAC
query = "What is the SMILES for <IUPAC> aspirin </IUPAC>?"
query = "Formula of < SMILES > CCO < /SMILES >?"
# Tag parser will fail
query = "Formula of <SMILES> CCO </SMILES>?"
❌ Incorrect: Wrong Tag Type
query = "What is the IUPAC name of <IUPAC> CCO </IUPAC>?"
# CCO is SMILES, not IUPAC
✅ Correct: Proper Tag Type
query = "What is the IUPAC name of <SMILES> CCO </SMILES>?"
query = "Is <SMILES> CC(C)Cl toxic?"
# Missing closing tag
query = "Is <SMILES> CC(C)Cl </SMILES> toxic?"
Best Practices
Use Automatic Tagging
Let structure_chem_prompt handle tagging when possible: # Agent automatically calls:
structured = structure_chem_prompt( "What is aspirin's formula?" )
# Result: "What is the molecular formula of <IUPAC> aspirin </IUPAC>?"
Verify Tag Types
Ensure chemical entities use the correct tag:
Use <SMILES> for: CCO, c1ccccc1, CC(=O)O
Use <IUPAC> for: ethanol, benzene, acetic acid
Validate SMILES
Always validate SMILES outputs: response = answer_chemistry_query(query)
validation = validate_smiles_rdkit(extract_smiles(response))
if not validation[ "valid" ]:
# Handle error and replan
Handle Edge Cases
Common names (aspirin) → Use <IUPAC>
Systematic names (2-acetoxybenzoic acid) → Use <IUPAC>
Abbreviations (EtOH) → Convert to SMILES first
Tag System in Context
The complete tag workflow in ChemAgent:
Next Steps
LlaSMol Model Learn about the model that uses these tags
Agent Workflow See how tags fit into the agent cycle