Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/pranavkrishnasuresh/chemAgent/llms.txt

Use this file to discover all available pages before exploring further.

ChemAgent provides comprehensive SMILES validation using RDKit and a custom chemistry parser that identifies syntax and semantic errors with character-level precision.

Overview

The validation system detects:

Syntax Errors

Invalid characters, unclosed rings, mismatched parentheses

Semantic Errors

Valence issues, stereochemistry problems, aromaticity errors

Basic Usage

Using the Tool

from plan_execute_agent.chem_tools import validate_smiles_rdkit

# Valid SMILES
result = validate_smiles_rdkit.invoke({"smiles_string": "CCO"})
print(result)
# Output: {'valid': True, 'error_message': '[1, 1, 1]'}

# Invalid SMILES
result = validate_smiles_rdkit.invoke({"smiles_string": "C1CCQQ1"})
print(result)
# Output: {'valid': False, 'error_message': 'Invalid Character with Validity Vector: 111110011'}

Direct Parser Usage

from plan_execute_agent.chemistry_parser import parse_smiles

# Parse and get detailed results
result = parse_smiles("C1CCCCC")
print(result)
# Output:
# {
#   'valid': False,
#   'details': 'Syntax',
#   'error_message': 'RDKit error message',
#   'validity_vector': 'Unclosed Ring with Validity Vector: 0111111'
# }

Validity Vectors

The parser returns binary validity vectors where:
  • 1 = valid character
  • 0 = invalid/problematic character
This provides character-level error localization:
from plan_execute_agent.chemistry_parser import parse_smiles

# Example: Invalid character
result = parse_smiles("C1CCQQ1")
print(f"SMILES: C1CCQQ1")
print(f"Valid:  {result['validity_vector'].split(': ')[1]}")
# Output:
# SMILES: C1CCQQ1
# Valid:  111110011
#         ↑↑↑↑↑  ↑ ← positions 5-6 are invalid (Q characters)

Error Categories

The validation system categorizes errors into four types (plan_execute_agent/chemistry_parser.py:229):
error_categories = [
    "Unclosed Ring",       # Unclosed Ring
    "Invalid Character",   # Invalid Characters, Syntax Errors, Atom Not Recognized
    "Invalid Parentheses", # Invalid Closure of Parentheses
    "Semantic Issues",     # Flagged By DetectChemistryProblems
]

1. Unclosed Rings

Detect rings that aren’t properly closed:
from plan_execute_agent.chemistry_parser import detect_unclosed_ring

# Valid ring
print(detect_unclosed_ring("C1CCCCC1"))
# Output: '11111111' (all valid)

# Unclosed ring
print(detect_unclosed_ring("C1CCCCC"))
# Output: '0111111' (ring marker 1 is invalid)

# Multiple rings
print(detect_unclosed_ring("C1CCCC1C2CCCCC2"))
# Output: '111111111111111' (both rings closed)
Ring numbers can be 0-9 or %(10-99) for larger numbers:
  • C1CCCCC1 - valid
  • C%(10)CCCCCC%(10) - valid
  • C1CCCCC - invalid (unclosed)

2. Invalid Characters

Identify characters not part of SMILES syntax:
from plan_execute_agent.chemistry_parser import detect_invalid_characters

# Valid SMILES
print(detect_invalid_characters("CCO"))
# Output: '111'

# Invalid characters
print(detect_invalid_characters("C1CCQQ1"))
# Output: '1111001' (Q is not a valid atom symbol)

# Invalid charge specification
print(detect_invalid_characters("[C+2-]"))
# Output: '000000' (contradictory charge)
Valid SMILES tokens are defined by the pattern (plan_execute_agent/chemistry_parser.py:110):
pattern = r"(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|-|\+|\\\\|\/|:|~|@|\?|>|\*|\$|\%[0-9]{2}|[0-9])"

3. Invalid Parentheses

Detect mismatched or extra parentheses:
from plan_execute_agent.chemistry_parser import detect_invalid_parentheses

# Valid
print(detect_invalid_parentheses("C(C)C"))
# Output: '11111'

# Extra closing
print(detect_invalid_parentheses("C1(C2)C3)"))
# Output: '111111110' (last parenthesis unmatched)

# Extra opening
print(detect_invalid_parentheses("C(C(C"))
# Output: '101010' (unclosed parentheses)

4. Semantic Issues

Detect chemistry problems even in syntactically valid SMILES:
from plan_execute_agent.chemistry_parser import parse_smiles

# Valence error
result = parse_smiles("F[Cl](=O)=O")
print(result['validity_vector'])
# Indicates atoms with valence problems

# Valid molecule
result = parse_smiles("CC(=O)OCC1=CC=CC=C1C(=O)O")  # Aspirin
print(result)
# Output: {'valid': True, 'details': 'No Error', 'validity_vector': '[1, 1, 1, ...]'}
Semantic errors detected by RDKit’s DetectChemistryProblems (plan_execute_agent/chemistry_parser.py:180):
  • AtomValenceException
  • KekulizeException
  • AtomKekulizeException
  • Other chemistry violations

Complete Parsing Workflow

The parse_smiles() function provides comprehensive validation (plan_execute_agent/chemistry_parser.py:258):
def parse_smiles(smiles_string):
    smiles_string = str(smiles_string).strip()
    detected_errors = {}
    
    # Try to create molecule
    mol = Chem.MolFromSmiles(smiles_string, sanitize=False)
    
    if mol is None:  # Syntax Issues
        error_vector = classify_error(error_message)
        if error_vector[0] == 0:
            detected_errors["Unclosed Ring"] = detect_unclosed_ring(smiles_string)
        if error_vector[1] == 0:
            detected_errors["Invalid Character"] = detect_invalid_characters(smiles_string)
        if error_vector[2] == 0:
            detected_errors["Invalid Parentheses"] = detect_invalid_parentheses(smiles_string)
        
        return {
            'valid': False,
            'details': 'Syntax',
            'error_message': error_message,
            'validity_vector': problem_message
        }
    
    # Check for semantic issues
    problem_list = detect_semantic_issues(mol, smiles_string)
    if len(problem_list) == 0:
        return {
            'valid': True,
            'details': 'No Error',
            'validity_vector': str([1] * len(smiles_string))
        }
    else:
        return {
            'valid': False,
            'details': 'Semantics',
            'validity_vector': problem_list['problem_message']
        }

Integration with Agent

The agent automatically validates SMILES outputs (plan_execute_agent/chem_tools.py:180):
@tool
def validate_smiles_rdkit(smiles_string: str) -> dict:
    """Validate a SMILES output string using RDKit, returning validity and error message if invalid."""
    print("SMILES validation tool call!")
    # Using the Chemistry Parser instead of RDKit
    parsing_details = parse_smiles(smiles_string)
    print("Input: ", smiles_string)
    print("Output: ", str(parsing_details))
    # Prepare the result dictionary
    result = {
        "valid": parsing_details["valid"],
        "error_message": parsing_details[
            "validity_vector"
        ],  # Validity Vector added for LLM Agent context
    }
    if not parsing_details["valid"]:
        # Store Error Message in LlaSmol Response, SEPARATED BY $ sign for parsing
        llasmol_response.errors += result["error_message"] + "$"
    return result

Batch Validation

Validate multiple SMILES efficiently:
from plan_execute_agent.chemistry_parser import parse_smiles
from typing import List, Dict

def batch_validate(smiles_list: List[str]) -> Dict[str, dict]:
    """
    Validate multiple SMILES and return results
    """
    results = {}
    
    for smiles in smiles_list:
        result = parse_smiles(smiles)
        results[smiles] = {
            'valid': result['valid'],
            'error_type': result['details'],
            'validity_vector': result.get('validity_vector', '')
        }
    
    return results

# Example
smiles_batch = [
    "CCO",           # Valid
    "C1CCCCC",       # Unclosed ring
    "C1CCQQ1",       # Invalid character
    "CC(C)C",        # Valid
    "[C+2-]"         # Invalid charge
]

results = batch_validate(smiles_batch)
for smiles, result in results.items():
    status = "✓" if result['valid'] else "✗"
    print(f"{status} {smiles}: {result['error_type']}")

Validation in Workflows

Before Property Prediction

from LLM4Chem.generation import LlaSMolGeneration
from plan_execute_agent.chemistry_parser import parse_smiles

generator = LlaSMolGeneration('osunlp/LlaSMol-Mistral-7B')

smiles = "CC(C)Cl"

# Validate first
validation = parse_smiles(smiles)
if not validation['valid']:
    print(f"Invalid SMILES: {validation['validity_vector']}")
else:
    # Proceed with prediction
    query = f"How soluble is <SMILES> {smiles} </SMILES>?"
    result = generator.generate(query)
    print(result[0]['output'][0])

After Molecule Generation

import re
from LLM4Chem.generation import LlaSMolGeneration
from plan_execute_agent.chemistry_parser import parse_smiles

generator = LlaSMolGeneration('osunlp/LlaSMol-Mistral-7B')

# Generate molecule
query = "Generate a simple aromatic compound"
result = generator.generate(query)

# Extract SMILES
match = re.search(r'<SMILES>\s*(.+?)\s*</SMILES>', result[0]['output'][0])
if match:
    generated_smiles = match.group(1)
    
    # Validate
    validation = parse_smiles(generated_smiles)
    if validation['valid']:
        print(f"Valid molecule generated: {generated_smiles}")
    else:
        print(f"Invalid generation: {validation['error_type']}")
        print(f"Validity vector: {validation['validity_vector']}")

Error Visualization

Create visual feedback for errors:
def visualize_errors(smiles: str, validity_vector: str):
    """
    Display SMILES with error highlighting
    """
    # Extract just the binary string if it contains description
    if ':' in validity_vector:
        validity_vector = validity_vector.split(':')[-1].strip()
    
    if len(validity_vector) != len(smiles):
        print(f"SMILES: {smiles}")
        print(f"Error: {validity_vector}")
        return
    
    print(f"SMILES: {smiles}")
    print(f"Valid:  {validity_vector}")
    print(f"        {''.join(['^' if c == '0' else ' ' for c in validity_vector])}")

# Example
from plan_execute_agent.chemistry_parser import parse_smiles

smiles = "C1CCQQ1"
result = parse_smiles(smiles)
if not result['valid']:
    visualize_errors(smiles, result['validity_vector'])

# Output:
# SMILES: C1CCQQ1
# Valid:  1111001
#             ^^

Advanced Usage

Custom Validation Rules

from plan_execute_agent.chemistry_parser import parse_smiles
from rdkit import Chem

def validate_with_rules(smiles: str, rules: dict) -> dict:
    """
    Validate with additional custom rules
    
    Args:
        smiles: SMILES string
        rules: dict with validation criteria
            - max_atoms: maximum number of atoms
            - allowed_elements: set of allowed element symbols
            - min_rings: minimum number of rings
    """
    # Basic validation
    result = parse_smiles(smiles)
    
    if not result['valid']:
        return result
    
    # Additional checks
    mol = Chem.MolFromSmiles(smiles)
    errors = []
    
    if 'max_atoms' in rules:
        num_atoms = mol.GetNumAtoms()
        if num_atoms > rules['max_atoms']:
            errors.append(f"Too many atoms: {num_atoms} > {rules['max_atoms']}")
    
    if 'allowed_elements' in rules:
        elements = {atom.GetSymbol() for atom in mol.GetAtoms()}
        invalid = elements - rules['allowed_elements']
        if invalid:
            errors.append(f"Invalid elements: {invalid}")
    
    if 'min_rings' in rules:
        num_rings = mol.GetRingInfo().NumRings()
        if num_rings < rules['min_rings']:
            errors.append(f"Not enough rings: {num_rings} < {rules['min_rings']}")
    
    if errors:
        return {
            'valid': False,
            'details': 'Custom Rules',
            'error_message': '; '.join(errors)
        }
    
    return result

# Example
rules = {
    'max_atoms': 20,
    'allowed_elements': {'C', 'H', 'O', 'N'},
    'min_rings': 1
}

result = validate_with_rules("C1=CC=CC=C1", rules)
print(result)  # Valid benzene

result = validate_with_rules("CCO", rules)
print(result)  # Fails: no rings

Best Practices

  • Always validate user input before processing
  • Always validate generated SMILES from models
  • Before expensive computations or database queries
  • After any string manipulation of SMILES
  • Check valid field before using SMILES
  • Log validity vectors for debugging
  • Provide clear feedback to users
  • Consider auto-correction for simple errors
  • Cache validation results for repeated SMILES
  • Use batch validation for multiple SMILES
  • Validate early to fail fast
  • Consider parallel validation for large datasets

Common Issues

# Problem
smiles = "C1CCCCC"  # Missing closing 1

# Fix
smiles = "C1CCCCC1"  # Properly closed

See Also

Build docs developers (and LLMs) love