Documentation Index Fetch the complete documentation index at: https://mintlify.com/pranavkrishnasuresh/chemAgent/llms.txt
Use this file to discover all available pages before exploring further.
ChemAgent provides comprehensive SMILES validation using RDKit and a custom chemistry parser that identifies syntax and semantic errors with character-level precision.
Overview
The validation system detects:
Syntax Errors Invalid characters, unclosed rings, mismatched parentheses
Semantic Errors Valence issues, stereochemistry problems, aromaticity errors
Basic Usage
from plan_execute_agent.chem_tools import validate_smiles_rdkit
# Valid SMILES
result = validate_smiles_rdkit.invoke({ "smiles_string" : "CCO" })
print (result)
# Output: {'valid': True, 'error_message': '[1, 1, 1]'}
# Invalid SMILES
result = validate_smiles_rdkit.invoke({ "smiles_string" : "C1CCQQ1" })
print (result)
# Output: {'valid': False, 'error_message': 'Invalid Character with Validity Vector: 111110011'}
Direct Parser Usage
from plan_execute_agent.chemistry_parser import parse_smiles
# Parse and get detailed results
result = parse_smiles( "C1CCCCC" )
print (result)
# Output:
# {
# 'valid': False,
# 'details': 'Syntax',
# 'error_message': 'RDKit error message',
# 'validity_vector': 'Unclosed Ring with Validity Vector: 0111111'
# }
Validity Vectors
The parser returns binary validity vectors where:
1 = valid character
0 = invalid/problematic character
This provides character-level error localization:
from plan_execute_agent.chemistry_parser import parse_smiles
# Example: Invalid character
result = parse_smiles( "C1CCQQ1" )
print ( f "SMILES: C1CCQQ1" )
print ( f "Valid: { result[ 'validity_vector' ].split( ': ' )[ 1 ] } " )
# Output:
# SMILES: C1CCQQ1
# Valid: 111110011
# ↑↑↑↑↑ ↑ ← positions 5-6 are invalid (Q characters)
Error Categories
The validation system categorizes errors into four types (plan_execute_agent/chemistry_parser.py:229):
error_categories = [
"Unclosed Ring" , # Unclosed Ring
"Invalid Character" , # Invalid Characters, Syntax Errors, Atom Not Recognized
"Invalid Parentheses" , # Invalid Closure of Parentheses
"Semantic Issues" , # Flagged By DetectChemistryProblems
]
1. Unclosed Rings
Detect rings that aren’t properly closed:
from plan_execute_agent.chemistry_parser import detect_unclosed_ring
# Valid ring
print (detect_unclosed_ring( "C1CCCCC1" ))
# Output: '11111111' (all valid)
# Unclosed ring
print (detect_unclosed_ring( "C1CCCCC" ))
# Output: '0111111' (ring marker 1 is invalid)
# Multiple rings
print (detect_unclosed_ring( "C1CCCC1C2CCCCC2" ))
# Output: '111111111111111' (both rings closed)
Ring numbers can be 0-9 or %(10-99) for larger numbers:
C1CCCCC1 - valid
C%(10)CCCCCC%(10) - valid
C1CCCCC - invalid (unclosed)
2. Invalid Characters
Identify characters not part of SMILES syntax:
from plan_execute_agent.chemistry_parser import detect_invalid_characters
# Valid SMILES
print (detect_invalid_characters( "CCO" ))
# Output: '111'
# Invalid characters
print (detect_invalid_characters( "C1CCQQ1" ))
# Output: '1111001' (Q is not a valid atom symbol)
# Invalid charge specification
print (detect_invalid_characters( "[C+2-]" ))
# Output: '000000' (contradictory charge)
Valid SMILES tokens are defined by the pattern (plan_execute_agent/chemistry_parser.py:110):
pattern = r " ( \[ [ ^ \] ] + ] | Br ? | Cl ? | N | O | S | P | F | I | b | c | n | o | s | p | \( | \) | \. | = | # | - | \+ | \\\\ | \/ | : | ~ | @ | \? | > | \* | \$ | \% [ 0-9 ] {2} | [ 0-9 ]) "
3. Invalid Parentheses
Detect mismatched or extra parentheses:
from plan_execute_agent.chemistry_parser import detect_invalid_parentheses
# Valid
print (detect_invalid_parentheses( "C(C)C" ))
# Output: '11111'
# Extra closing
print (detect_invalid_parentheses( "C1(C2)C3)" ))
# Output: '111111110' (last parenthesis unmatched)
# Extra opening
print (detect_invalid_parentheses( "C(C(C" ))
# Output: '101010' (unclosed parentheses)
4. Semantic Issues
Detect chemistry problems even in syntactically valid SMILES:
from plan_execute_agent.chemistry_parser import parse_smiles
# Valence error
result = parse_smiles( "F[Cl](=O)=O" )
print (result[ 'validity_vector' ])
# Indicates atoms with valence problems
# Valid molecule
result = parse_smiles( "CC(=O)OCC1=CC=CC=C1C(=O)O" ) # Aspirin
print (result)
# Output: {'valid': True, 'details': 'No Error', 'validity_vector': '[1, 1, 1, ...]'}
Semantic errors detected by RDKit’s DetectChemistryProblems (plan_execute_agent/chemistry_parser.py:180):
AtomValenceException
KekulizeException
AtomKekulizeException
Other chemistry violations
Complete Parsing Workflow
The parse_smiles() function provides comprehensive validation (plan_execute_agent/chemistry_parser.py:258):
def parse_smiles ( smiles_string ):
smiles_string = str (smiles_string).strip()
detected_errors = {}
# Try to create molecule
mol = Chem.MolFromSmiles(smiles_string, sanitize = False )
if mol is None : # Syntax Issues
error_vector = classify_error(error_message)
if error_vector[ 0 ] == 0 :
detected_errors[ "Unclosed Ring" ] = detect_unclosed_ring(smiles_string)
if error_vector[ 1 ] == 0 :
detected_errors[ "Invalid Character" ] = detect_invalid_characters(smiles_string)
if error_vector[ 2 ] == 0 :
detected_errors[ "Invalid Parentheses" ] = detect_invalid_parentheses(smiles_string)
return {
'valid' : False ,
'details' : 'Syntax' ,
'error_message' : error_message,
'validity_vector' : problem_message
}
# Check for semantic issues
problem_list = detect_semantic_issues(mol, smiles_string)
if len (problem_list) == 0 :
return {
'valid' : True ,
'details' : 'No Error' ,
'validity_vector' : str ([ 1 ] * len (smiles_string))
}
else :
return {
'valid' : False ,
'details' : 'Semantics' ,
'validity_vector' : problem_list[ 'problem_message' ]
}
Integration with Agent
The agent automatically validates SMILES outputs (plan_execute_agent/chem_tools.py:180):
@tool
def validate_smiles_rdkit ( smiles_string : str ) -> dict :
"""Validate a SMILES output string using RDKit, returning validity and error message if invalid."""
print ( "SMILES validation tool call!" )
# Using the Chemistry Parser instead of RDKit
parsing_details = parse_smiles(smiles_string)
print ( "Input: " , smiles_string)
print ( "Output: " , str (parsing_details))
# Prepare the result dictionary
result = {
"valid" : parsing_details[ "valid" ],
"error_message" : parsing_details[
"validity_vector"
], # Validity Vector added for LLM Agent context
}
if not parsing_details[ "valid" ]:
# Store Error Message in LlaSmol Response, SEPARATED BY $ sign for parsing
llasmol_response.errors += result[ "error_message" ] + "$"
return result
Batch Validation
Validate multiple SMILES efficiently:
from plan_execute_agent.chemistry_parser import parse_smiles
from typing import List, Dict
def batch_validate ( smiles_list : List[ str ]) -> Dict[ str , dict ]:
"""
Validate multiple SMILES and return results
"""
results = {}
for smiles in smiles_list:
result = parse_smiles(smiles)
results[smiles] = {
'valid' : result[ 'valid' ],
'error_type' : result[ 'details' ],
'validity_vector' : result.get( 'validity_vector' , '' )
}
return results
# Example
smiles_batch = [
"CCO" , # Valid
"C1CCCCC" , # Unclosed ring
"C1CCQQ1" , # Invalid character
"CC(C)C" , # Valid
"[C+2-]" # Invalid charge
]
results = batch_validate(smiles_batch)
for smiles, result in results.items():
status = "✓" if result[ 'valid' ] else "✗"
print ( f " { status } { smiles } : { result[ 'error_type' ] } " )
Validation in Workflows
Before Property Prediction
from LLM4Chem.generation import LlaSMolGeneration
from plan_execute_agent.chemistry_parser import parse_smiles
generator = LlaSMolGeneration( 'osunlp/LlaSMol-Mistral-7B' )
smiles = "CC(C)Cl"
# Validate first
validation = parse_smiles(smiles)
if not validation[ 'valid' ]:
print ( f "Invalid SMILES: { validation[ 'validity_vector' ] } " )
else :
# Proceed with prediction
query = f "How soluble is <SMILES> { smiles } </SMILES>?"
result = generator.generate(query)
print (result[ 0 ][ 'output' ][ 0 ])
After Molecule Generation
import re
from LLM4Chem.generation import LlaSMolGeneration
from plan_execute_agent.chemistry_parser import parse_smiles
generator = LlaSMolGeneration( 'osunlp/LlaSMol-Mistral-7B' )
# Generate molecule
query = "Generate a simple aromatic compound"
result = generator.generate(query)
# Extract SMILES
match = re.search( r '<SMILES> \s * ( . +? ) \s * </SMILES>' , result[ 0 ][ 'output' ][ 0 ])
if match:
generated_smiles = match.group( 1 )
# Validate
validation = parse_smiles(generated_smiles)
if validation[ 'valid' ]:
print ( f "Valid molecule generated: { generated_smiles } " )
else :
print ( f "Invalid generation: { validation[ 'error_type' ] } " )
print ( f "Validity vector: { validation[ 'validity_vector' ] } " )
Error Visualization
Create visual feedback for errors:
def visualize_errors ( smiles : str , validity_vector : str ):
"""
Display SMILES with error highlighting
"""
# Extract just the binary string if it contains description
if ':' in validity_vector:
validity_vector = validity_vector.split( ':' )[ - 1 ].strip()
if len (validity_vector) != len (smiles):
print ( f "SMILES: { smiles } " )
print ( f "Error: { validity_vector } " )
return
print ( f "SMILES: { smiles } " )
print ( f "Valid: { validity_vector } " )
print ( f " { '' .join([ '^' if c == '0' else ' ' for c in validity_vector]) } " )
# Example
from plan_execute_agent.chemistry_parser import parse_smiles
smiles = "C1CCQQ1"
result = parse_smiles(smiles)
if not result[ 'valid' ]:
visualize_errors(smiles, result[ 'validity_vector' ])
# Output:
# SMILES: C1CCQQ1
# Valid: 1111001
# ^^
Advanced Usage
Custom Validation Rules
from plan_execute_agent.chemistry_parser import parse_smiles
from rdkit import Chem
def validate_with_rules ( smiles : str , rules : dict ) -> dict :
"""
Validate with additional custom rules
Args:
smiles: SMILES string
rules: dict with validation criteria
- max_atoms: maximum number of atoms
- allowed_elements: set of allowed element symbols
- min_rings: minimum number of rings
"""
# Basic validation
result = parse_smiles(smiles)
if not result[ 'valid' ]:
return result
# Additional checks
mol = Chem.MolFromSmiles(smiles)
errors = []
if 'max_atoms' in rules:
num_atoms = mol.GetNumAtoms()
if num_atoms > rules[ 'max_atoms' ]:
errors.append( f "Too many atoms: { num_atoms } > { rules[ 'max_atoms' ] } " )
if 'allowed_elements' in rules:
elements = {atom.GetSymbol() for atom in mol.GetAtoms()}
invalid = elements - rules[ 'allowed_elements' ]
if invalid:
errors.append( f "Invalid elements: { invalid } " )
if 'min_rings' in rules:
num_rings = mol.GetRingInfo().NumRings()
if num_rings < rules[ 'min_rings' ]:
errors.append( f "Not enough rings: { num_rings } < { rules[ 'min_rings' ] } " )
if errors:
return {
'valid' : False ,
'details' : 'Custom Rules' ,
'error_message' : '; ' .join(errors)
}
return result
# Example
rules = {
'max_atoms' : 20 ,
'allowed_elements' : { 'C' , 'H' , 'O' , 'N' },
'min_rings' : 1
}
result = validate_with_rules( "C1=CC=CC=C1" , rules)
print (result) # Valid benzene
result = validate_with_rules( "CCO" , rules)
print (result) # Fails: no rings
Best Practices
Always validate user input before processing
Always validate generated SMILES from models
Before expensive computations or database queries
After any string manipulation of SMILES
Check valid field before using SMILES
Log validity vectors for debugging
Provide clear feedback to users
Consider auto-correction for simple errors
Common Issues
Unclosed Rings
Invalid Atoms
Parentheses
Valence
# Problem
smiles = "C1CCCCC" # Missing closing 1
# Fix
smiles = "C1CCCCC1" # Properly closed
# Problem
smiles = "CXC" # X is not a valid atom
# Fix
smiles = "CBC" # Use valid atom symbol
# Problem
smiles = "C(CC)C)" # Extra closing parenthesis
# Fix
smiles = "C(CC)C" # Balanced parentheses
# Problem
smiles = "CH5" # Carbon can't have 5 hydrogens
# Fix
smiles = "C" # Implicit hydrogens added correctly
See Also