Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/pranavkrishnasuresh/chemAgent/llms.txt

Use this file to discover all available pages before exploring further.

ChemAgent uses a specialized tag system to precisely identify and structure chemical information in queries. Understanding these tags is essential for effective interaction with the system.

Why Tags Matter

Tags serve three critical purposes:

Precision

Clearly identify which strings are chemical entities vs. natural language

Context

Inform the LlaSMol model about the type of chemical information provided

Parsing

Enable automatic SMILES canonicalization and validation

Input Tags

Two tags are used to wrap chemical information in queries:

<SMILES> Tag

Wraps SMILES (Simplified Molecular Input Line Entry System) strings:
<SMILES> CCO </SMILES>
Usage:
query = "What is the molecular formula of <SMILES> CC(C)Cl </SMILES>?"
Characteristics:
  • Triggers automatic canonicalization via RDKit
  • Must contain valid SMILES syntax
  • Can represent any molecular structure
See plan_execute_agent/chem_tools.py:26 for SMILES handling.

<IUPAC> Tag

Wraps IUPAC (International Union of Pure and Applied Chemistry) names:
<IUPAC> aspirin </IUPAC>
Usage:
query = "Provide the SMILES for <IUPAC> 4-ethyl-4-methyloxolan-2-one </IUPAC>."
Characteristics:
  • Accepts systematic IUPAC names
  • Also accepts common names (aspirin, caffeine, etc.)
  • Whitespace and capitalization are preserved
See plan_execute_agent/chem_tools.py:22 for IUPAC handling.

Output Tags

Three tags appear in model responses from LlaSMol:

<MOLFORMULA> Tag

Wraps molecular formulas:
<MOLFORMULA> C9H8O4 </MOLFORMULA>
Example Response:
Query: "What is the molecular formula of <IUPAC> aspirin </IUPAC>?"
Response: "<MOLFORMULA> C9H8O4 </MOLFORMULA>"
See LLM4Chem/README.md:27 for examples.

<NUMBER> Tag

Wraps numerical predictions (solubility, logD, etc.):
<NUMBER> -1.41 </NUMBER>
Example Response:
Query: "How soluble is <SMILES> CC(C)Cl </SMILES>?"
Response: "Its log solubility is <NUMBER> -1.41 </NUMBER> mol/L."
See LLM4Chem/README.md:56 for examples.

<BOOLEAN> Tag

Wraps yes/no predictions (toxicity, BBB permeability, etc.):
<BOOLEAN> Yes </BOOLEAN>
or
<BOOLEAN> No </BOOLEAN>
Example Response:
Query: "Is <SMILES> COC[C@@H](NC(C)=O)C(=O)NCC1=CC=CC=C1 </SMILES> toxic?"
Response: "<BOOLEAN> No </BOOLEAN>"
See LLM4Chem/README.md:77 for examples.

Automatic Tagging

ChemAgent includes the structure_chem_prompt tool that automatically adds tags to unstructured queries:

How It Works

@tool
def structure_chem_prompt(original_prompt):
    """Structure and tag IUPAC or SMILES chemical information."""
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": SYSTEM_TAG_PROMPT,
        }, {
            "role": "user",
            "content": f"Structure the input query: {original_prompt}",
        }],
        response_format=StructuredPrompt,
    )
    return {"new_prompt": simplified_prompt.get("new_prompt")}
See plan_execute_agent/chem_tools.py:57 for implementation.

System Prompt

The tagging tool uses a detailed system prompt:
SYSTEM_TAG_PROMPT = """
You are an EXPERT chemical information tagger. Your task is to format the input query 
based on the information below. You MUST return ONLY the formatted input query!

When processing chemical information, use only two tags in the input query: 
<SMILES> for SMILES representations and <IUPAC> for IUPAC names.

Tag Definitions:
SMILES: <SMILES> ... </SMILES> for chemical structure in SMILES notation.
IUPAC: <IUPAC> ... </IUPAC> for the IUPAC name of the compound.

Instructions:
1. In the input query, use only the <SMILES> and <IUPAC> tags to wrap the appropriate information.
2. Ensure no extra characters or spaces are present within the tags.
"""
See plan_execute_agent/chem_tools.py:7 for the complete prompt.

Tagging Examples

"What is the molecular formula of aspirin?"
"Can you tell me the IUPAC name of C1CCOC1?"
"What is the molecular formula of 2,5-diphenyl-1,3-oxazole and what is the name of C1CCOC1?"

SMILES Canonicalization

When SMILES are wrapped in tags, they are automatically canonicalized using RDKit:

What is Canonicalization?

SMILES strings can represent the same molecule in multiple ways:
# All represent ethanol
"CCO"     # Start from carbon
"OCC"     # Start from oxygen  
"C(O)C"   # Explicit branching
Canonicalization converts all variants to a standard form:
from rdkit import Chem

mol = Chem.MolFromSmiles("OCC")
canonical = Chem.MolToSmiles(mol)
print(canonical)  # Output: "CCO"

Automatic Process

The LlaSMol generation pipeline handles this automatically:
def canonicalize_smiles_in_text(text: str) -> str:
    """Find <SMILES> tags and canonicalize their contents."""
    pattern = r'<SMILES>\s*([^<]+)\s*</SMILES>'
    
    def replace_smiles(match):
        smiles = match.group(1).strip()
        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            return match.group(0)  # Keep original if invalid
        canonical = Chem.MolToSmiles(mol)
        return f"<SMILES> {canonical} </SMILES>"
    
    return re.sub(pattern, replace_smiles, text)
See LLM4Chem/utils/smiles_canonicalization.py for implementation.

Benefits

The model receives standardized input regardless of how users write SMILES
Training and inference use the same canonical form, improving predictions
Canonicalization fails for invalid SMILES, providing early error detection

Properly Formatted Query Examples

Here are complete examples showing correct tag usage:

Name Conversion Queries

query = "Please provide the SMILES representation for <IUPAC> 4-ethyl-4-methyloxolan-2-one </IUPAC>."
Expected response:
"Of course. It's <SMILES> CCC1(C)COC(=O)C1 </SMILES>."

Property Prediction Queries

query = "How soluble is <SMILES> CC(C)Cl </SMILES>?"
Expected response:
"Its log solubility is <NUMBER> -1.41 </NUMBER> mol/L."

Molecule Description Queries

query = "Describe this molecule: <SMILES> CCOC(=O)C1=CN=CN1[C@H](C)C1=CC=CC=C1 </SMILES>"
Expected response:
"The molecule is an imidazole derivative with short-acting sedative, hypnotic, 
and general anesthetic properties..."

Tag Validation

The system validates tags at multiple stages:

1. Structure Validation

GPT-4o ensures tags are correctly formatted:
class StructuredPrompt(BaseModel):
    new_prompt: str  # Must contain properly closed tags
See plan_execute_agent/chem_tools.py:51 for validation.

2. SMILES Validation

RDKit validates SMILES content:
@tool
def validate_smiles_rdkit(smiles_string: str) -> dict:
    """Validate SMILES with detailed error reporting."""
    parsing_details = parse_smiles(smiles_string)
    return {
        "valid": parsing_details["valid"],
        "error_message": parsing_details["validity_vector"],
    }
See plan_execute_agent/chem_tools.py:180 for validation logic.

3. Output Extraction

The system extracts content between tags in responses:
import re

def extract_smiles(response: str) -> str:
    """Extract SMILES from tagged response."""
    match = re.search(r'<SMILES>\s*([^<]+)\s*</SMILES>', response)
    return match.group(1).strip() if match else ""
See LLM4Chem/extract_prediction.py for extraction logic.

Common Mistakes

Avoid These Common Errors:

❌ Incorrect: Missing Tags

query = "What is the SMILES for aspirin?"
# LlaSMol won't know "aspirin" is a chemical name

✅ Correct: Tagged IUPAC

query = "What is the SMILES for <IUPAC> aspirin </IUPAC>?"

❌ Incorrect: Spaces in Tags

query = "Formula of < SMILES > CCO < /SMILES >?"
# Tag parser will fail

✅ Correct: No Extra Spaces

query = "Formula of <SMILES> CCO </SMILES>?"

❌ Incorrect: Wrong Tag Type

query = "What is the IUPAC name of <IUPAC> CCO </IUPAC>?"
# CCO is SMILES, not IUPAC

✅ Correct: Proper Tag Type

query = "What is the IUPAC name of <SMILES> CCO </SMILES>?"

❌ Incorrect: Unclosed Tags

query = "Is <SMILES> CC(C)Cl toxic?"
# Missing closing tag

✅ Correct: Closed Tags

query = "Is <SMILES> CC(C)Cl </SMILES> toxic?"

Best Practices

1

Use Automatic Tagging

Let structure_chem_prompt handle tagging when possible:
# Agent automatically calls:
structured = structure_chem_prompt("What is aspirin's formula?")
# Result: "What is the molecular formula of <IUPAC> aspirin </IUPAC>?"
2

Verify Tag Types

Ensure chemical entities use the correct tag:
  • Use <SMILES> for: CCO, c1ccccc1, CC(=O)O
  • Use <IUPAC> for: ethanol, benzene, acetic acid
3

Validate SMILES

Always validate SMILES outputs:
response = answer_chemistry_query(query)
validation = validate_smiles_rdkit(extract_smiles(response))
if not validation["valid"]:
# Handle error and replan
4

Handle Edge Cases

  • Common names (aspirin) → Use <IUPAC>
  • Systematic names (2-acetoxybenzoic acid) → Use <IUPAC>
  • Abbreviations (EtOH) → Convert to SMILES first

Tag System in Context

The complete tag workflow in ChemAgent:

Next Steps

LlaSMol Model

Learn about the model that uses these tags

Agent Workflow

See how tags fit into the agent cycle

Build docs developers (and LLMs) love