Chemistry Tags - ChemAgent

ChemAgent uses a specialized tag system to precisely identify and structure chemical information in queries. Understanding these tags is essential for effective interaction with the system.

Why Tags Matter

Tags serve three critical purposes:

Precision

Clearly identify which strings are chemical entities vs. natural language

Context

Inform the LlaSMol model about the type of chemical information provided

Parsing

Enable automatic SMILES canonicalization and validation

Input Tags

Two tags are used to wrap chemical information in queries:

`<SMILES>` Tag

Wraps SMILES (Simplified Molecular Input Line Entry System) strings:

<SMILES> CCO </SMILES>

Usage:

query = "What is the molecular formula of <SMILES> CC(C)Cl </SMILES>?"

Characteristics:

Triggers automatic canonicalization via RDKit
Must contain valid SMILES syntax
Can represent any molecular structure

See plan_execute_agent/chem_tools.py:26 for SMILES handling.

`<IUPAC>` Tag

Wraps IUPAC (International Union of Pure and Applied Chemistry) names:

<IUPAC> aspirin </IUPAC>

Usage:

query = "Provide the SMILES for <IUPAC> 4-ethyl-4-methyloxolan-2-one </IUPAC>."

Characteristics:

Accepts systematic IUPAC names
Also accepts common names (aspirin, caffeine, etc.)
Whitespace and capitalization are preserved

See plan_execute_agent/chem_tools.py:22 for IUPAC handling.

Output Tags

Three tags appear in model responses from LlaSMol:

`<MOLFORMULA>` Tag

Wraps molecular formulas:

<MOLFORMULA> C9H8O4 </MOLFORMULA>

Example Response:

Query: "What is the molecular formula of <IUPAC> aspirin </IUPAC>?"
Response: "<MOLFORMULA> C9H8O4 </MOLFORMULA>"

See LLM4Chem/README.md:27 for examples.

`<NUMBER>` Tag

Wraps numerical predictions (solubility, logD, etc.):

<NUMBER> -1.41 </NUMBER>

Example Response:

Query: "How soluble is <SMILES> CC(C)Cl </SMILES>?"
Response: "Its log solubility is <NUMBER> -1.41 </NUMBER> mol/L."

See LLM4Chem/README.md:56 for examples.

`<BOOLEAN>` Tag

Wraps yes/no predictions (toxicity, BBB permeability, etc.):

<BOOLEAN> Yes </BOOLEAN>

<BOOLEAN> No </BOOLEAN>

Example Response:

Query: "Is <SMILES> COC[C@@H](NC(C)=O)C(=O)NCC1=CC=CC=C1 </SMILES> toxic?"
Response: "<BOOLEAN> No </BOOLEAN>"

See LLM4Chem/README.md:77 for examples.

Automatic Tagging

ChemAgent includes the structure_chem_prompt tool that automatically adds tags to unstructured queries:

How It Works

@tool
def structure_chem_prompt(original_prompt):
    """Structure and tag IUPAC or SMILES chemical information."""
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": SYSTEM_TAG_PROMPT,
        }, {
            "role": "user",
            "content": f"Structure the input query: {original_prompt}",
        }],
        response_format=StructuredPrompt,
    )
    return {"new_prompt": simplified_prompt.get("new_prompt")}

See plan_execute_agent/chem_tools.py:57 for implementation.

System Prompt

The tagging tool uses a detailed system prompt:

SYSTEM_TAG_PROMPT = """
You are an EXPERT chemical information tagger. Your task is to format the input query 
based on the information below. You MUST return ONLY the formatted input query!

When processing chemical information, use only two tags in the input query: 
<SMILES> for SMILES representations and <IUPAC> for IUPAC names.

Tag Definitions:
SMILES: <SMILES> ... </SMILES> for chemical structure in SMILES notation.
IUPAC: <IUPAC> ... </IUPAC> for the IUPAC name of the compound.

Instructions:
1. In the input query, use only the <SMILES> and <IUPAC> tags to wrap the appropriate information.
2. Ensure no extra characters or spaces are present within the tags.
"""

See plan_execute_agent/chem_tools.py:7 for the complete prompt.

Tagging Examples

"What is the molecular formula of aspirin?"

"Can you tell me the IUPAC name of C1CCOC1?"

"What is the molecular formula of 2,5-diphenyl-1,3-oxazole and what is the name of C1CCOC1?"

SMILES Canonicalization

When SMILES are wrapped in tags, they are automatically canonicalized using RDKit:

What is Canonicalization?

SMILES strings can represent the same molecule in multiple ways:

# All represent ethanol
"CCO"     # Start from carbon
"OCC"     # Start from oxygen  
"C(O)C"   # Explicit branching

Canonicalization converts all variants to a standard form:

from rdkit import Chem

mol = Chem.MolFromSmiles("OCC")
canonical = Chem.MolToSmiles(mol)
print(canonical)  # Output: "CCO"

Automatic Process

The LlaSMol generation pipeline handles this automatically:

def canonicalize_smiles_in_text(text: str) -> str:
    """Find <SMILES> tags and canonicalize their contents."""
    pattern = r'<SMILES>\s*([^<]+)\s*</SMILES>'
    
    def replace_smiles(match):
        smiles = match.group(1).strip()
        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            return match.group(0)  # Keep original if invalid
        canonical = Chem.MolToSmiles(mol)
        return f"<SMILES> {canonical} </SMILES>"
    
    return re.sub(pattern, replace_smiles, text)

See LLM4Chem/utils/smiles_canonicalization.py for implementation.

Benefits

Consistency

The model receives standardized input regardless of how users write SMILES

Accuracy

Training and inference use the same canonical form, improving predictions

Validation

Canonicalization fails for invalid SMILES, providing early error detection

Properly Formatted Query Examples

Here are complete examples showing correct tag usage:

Name Conversion Queries

IUPAC to SMILES
SMILES to IUPAC
SMILES to Formula
IUPAC to Formula

query = "Please provide the SMILES representation for <IUPAC> 4-ethyl-4-methyloxolan-2-one </IUPAC>."

Expected response:

"Of course. It's <SMILES> CCC1(C)COC(=O)C1 </SMILES>."

query = "Translate <SMILES> CCC(C)C1CNCCCNC1 </SMILES> to its IUPAC name."

Expected response:

"<IUPAC> 3-butan-2-yl-1,5-diazocane </IUPAC>"

query = "What is the molecular formula for <SMILES> S=P1(N(CCCl)CCCl)NCCCO1 </SMILES>?"

Expected response:

"It is <MOLFORMULA> C7H15Cl2N2OPS </MOLFORMULA>."

query = "What is the molecular formula of <IUPAC> 2,5-diphenyl-1,3-oxazole </IUPAC>?"

Expected response:

"<MOLFORMULA> C15H11NO </MOLFORMULA>"

Property Prediction Queries

Solubility
Toxicity
BBB Permeability

query = "How soluble is <SMILES> CC(C)Cl </SMILES>?"

Expected response:

"Its log solubility is <NUMBER> -1.41 </NUMBER> mol/L."

query = "Is <SMILES> COC[C@@H](NC(C)=O)C(=O)NCC1=CC=CC=C1 </SMILES> toxic?"

Expected response:

"<BOOLEAN> No </BOOLEAN>"

query = "Is blood-brain barrier permeability a property of <SMILES> CCNC(=O)/C=C/C1=CC=CC(Br)=C1 </SMILES>?"

Expected response:

"<BOOLEAN> Yes </BOOLEAN>"

Molecule Description Queries

Captioning
Generation

query = "Describe this molecule: <SMILES> CCOC(=O)C1=CN=CN1[C@H](C)C1=CC=CC=C1 </SMILES>"

Expected response:

"The molecule is an imidazole derivative with short-acting sedative, hypnotic, 
and general anesthetic properties..."

query = """Give me a molecule that satisfies: The molecule is a red-coloured pigment 
with antibiotic properties. It has a role as an antimicrobial agent."""

No tags needed in description for molecule generation

Expected response:

"Here is a potential molecule: <SMILES> CCCCCC1=C(C)NC(/C=C2\N=C(C3=CC=CN3)C=C2OC)=C1 </SMILES>"

Tag Validation

The system validates tags at multiple stages:

1. Structure Validation

GPT-4o ensures tags are correctly formatted:

class StructuredPrompt(BaseModel):
    new_prompt: str  # Must contain properly closed tags

See plan_execute_agent/chem_tools.py:51 for validation.

2. SMILES Validation

RDKit validates SMILES content:

@tool
def validate_smiles_rdkit(smiles_string: str) -> dict:
    """Validate SMILES with detailed error reporting."""
    parsing_details = parse_smiles(smiles_string)
    return {
        "valid": parsing_details["valid"],
        "error_message": parsing_details["validity_vector"],
    }

See plan_execute_agent/chem_tools.py:180 for validation logic.

3. Output Extraction

The system extracts content between tags in responses:

import re

def extract_smiles(response: str) -> str:
    """Extract SMILES from tagged response."""
    match = re.search(r'<SMILES>\s*([^<]+)\s*</SMILES>', response)
    return match.group(1).strip() if match else ""

See LLM4Chem/extract_prediction.py for extraction logic.

Common Mistakes

Avoid These Common Errors:

❌ Incorrect: Missing Tags

query = "What is the SMILES for aspirin?"
# LlaSMol won't know "aspirin" is a chemical name

✅ Correct: Tagged IUPAC

query = "What is the SMILES for <IUPAC> aspirin </IUPAC>?"

❌ Incorrect: Spaces in Tags

query = "Formula of < SMILES > CCO < /SMILES >?"
# Tag parser will fail

✅ Correct: No Extra Spaces

query = "Formula of <SMILES> CCO </SMILES>?"

❌ Incorrect: Wrong Tag Type

query = "What is the IUPAC name of <IUPAC> CCO </IUPAC>?"
# CCO is SMILES, not IUPAC

✅ Correct: Proper Tag Type

query = "What is the IUPAC name of <SMILES> CCO </SMILES>?"

❌ Incorrect: Unclosed Tags

query = "Is <SMILES> CC(C)Cl toxic?"
# Missing closing tag

✅ Correct: Closed Tags

query = "Is <SMILES> CC(C)Cl </SMILES> toxic?"

Best Practices

Use Automatic Tagging

Let structure_chem_prompt handle tagging when possible:

# Agent automatically calls:
structured = structure_chem_prompt("What is aspirin's formula?")
# Result: "What is the molecular formula of <IUPAC> aspirin </IUPAC>?"

Verify Tag Types

Ensure chemical entities use the correct tag:

Use <SMILES> for: CCO, c1ccccc1, CC(=O)O
Use <IUPAC> for: ethanol, benzene, acetic acid

Validate SMILES

Always validate SMILES outputs:

response = answer_chemistry_query(query)
validation = validate_smiles_rdkit(extract_smiles(response))
if not validation["valid"]:
# Handle error and replan

Handle Edge Cases

Common names (aspirin) → Use <IUPAC>
Systematic names (2-acetoxybenzoic acid) → Use <IUPAC>
Abbreviations (EtOH) → Convert to SMILES first

Tag System in Context

The complete tag workflow in ChemAgent:

Next Steps

LlaSMol Model

Learn about the model that uses these tags

Agent Workflow

See how tags fit into the agent cycle

Get Started

Core Concepts

Guides

Configuration

Documentation Index

​Why Tags Matter

Precision

Context

Parsing

​Input Tags

​<SMILES> Tag

​<IUPAC> Tag

​Output Tags

​<MOLFORMULA> Tag

​<NUMBER> Tag

​<BOOLEAN> Tag

​Automatic Tagging

​How It Works

​System Prompt

​Tagging Examples

​SMILES Canonicalization

​What is Canonicalization?

​Automatic Process

​Benefits

​Properly Formatted Query Examples

​Name Conversion Queries

​Property Prediction Queries

​Molecule Description Queries

​Tag Validation

​1. Structure Validation

​2. SMILES Validation

​3. Output Extraction

​Common Mistakes

​❌ Incorrect: Missing Tags

​✅ Correct: Tagged IUPAC

​❌ Incorrect: Spaces in Tags

​✅ Correct: No Extra Spaces

​❌ Incorrect: Wrong Tag Type

​✅ Correct: Proper Tag Type

​❌ Incorrect: Unclosed Tags

​✅ Correct: Closed Tags

​Best Practices

​Tag System in Context

​Next Steps

LlaSMol Model

Agent Workflow

Build docs developers (and LLMs) love

Why Tags Matter

Input Tags

`<SMILES>` Tag

`<IUPAC>` Tag

Output Tags

`<MOLFORMULA>` Tag

`<NUMBER>` Tag

`<BOOLEAN>` Tag

Automatic Tagging

How It Works

System Prompt

Tagging Examples

SMILES Canonicalization

What is Canonicalization?

Automatic Process

Benefits

Properly Formatted Query Examples

Name Conversion Queries

Property Prediction Queries

Molecule Description Queries

Tag Validation

1. Structure Validation

2. SMILES Validation

3. Output Extraction

Common Mistakes

❌ Incorrect: Missing Tags

✅ Correct: Tagged IUPAC

❌ Incorrect: Spaces in Tags

✅ Correct: No Extra Spaces

❌ Incorrect: Wrong Tag Type

✅ Correct: Proper Tag Type

❌ Incorrect: Unclosed Tags

✅ Correct: Closed Tags

Best Practices

Tag System in Context

Next Steps