Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/pranavkrishnasuresh/chemAgent/llms.txt

Use this file to discover all available pages before exploring further.

Overview

The validate_smiles_rdkit tool validates SMILES (Simplified Molecular Input Line Entry System) strings using RDKit and a custom chemistry parser. It detects both syntax and semantic issues, providing detailed validity vectors that pinpoint exact error locations.

Function Signature

@tool
def validate_smiles_rdkit(smiles_string: str) -> dict:
    """Validate a SMILES output string using RDKit, returning validity and error message if invalid."""

Parameters

smiles_string
str
required
The SMILES notation string to validate. Can contain syntax errors, semantic issues, or be completely valid.

Response

valid
bool
True if the SMILES string is valid, False if any errors are detected.
error_message
str
Contains the validity vector information. For valid SMILES, this is a string of 1s. For invalid SMILES, this contains detailed error descriptions with validity vectors.

Validity Vector Format

The validity vector is a binary string where:
  • 1 = valid character at this position
  • 0 = invalid character at this position

Format Structure

For invalid SMILES, the error message follows this pattern:
[Error Type] with Validity Vector: [binary string]
Multiple errors are comma-separated:
[Error Type 1] with Validity Vector: [binary string],[Error Type 2] with Validity Vector: [binary string]

Error Categories

The parser detects four main categories of errors:

1. Unclosed Ring

Detects rings that are opened but never closed. Example:
validate_smiles_rdkit("C1CCCCC")
Response:
{
  "valid": false,
  "error_message": "Unclosed Ring with Validity Vector: 0111111"
}
The 0 at position 0 indicates the unclosed ring marker 1.

2. Invalid Character

Detects unrecognized characters or syntax errors in SMILES notation. Example:
validate_smiles_rdkit("C1CCQQ1")
Response:
{
  "valid": false,
  "error_message": "Invalid Character with Validity Vector: 111100"
}
The 0s at positions 4-5 indicate the invalid characters Q.

3. Invalid Parentheses

Detects mismatched or unclosed parentheses. Example:
validate_smiles_rdkit("C1(C2)C3)")
Response:
{
  "valid": false,
  "error_message": "Invalid Parentheses with Validity Vector: 111111110"
}
The 0 at position 8 indicates the extra closing parenthesis.

4. Semantic Issues

Detects chemistry problems flagged by RDKit’s DetectChemistryProblems, such as:
  • Explicit valence errors
  • Kekulization failures
  • Aromaticity issues
  • Radical electrons
Example:
validate_smiles_rdkit("F[Cl](=O)=O")
Response:
{
  "valid": false,
  "error_message": "AtomValenceException with Validity Vector: 1011111111"
}

Validation Process

The tool performs validation in three stages:

Stage 1: Syntax Validation

Uses PartialSMILES parser to detect:
  1. Invalid characters using SMILES tokenizer pattern
  2. Unclosed or mismatched parentheses
  3. Unclosed ring markers (including %(N) notation)

Stage 2: Molecule Creation

mol = Chem.MolFromSmiles(smiles_string, sanitize=False)
Attempts to create an RDKit molecule object without sanitization.

Stage 3: Semantic Analysis

For successfully created molecules:
problems = Chem.DetectChemistryProblems(mol)
Detects chemistry-specific issues like valence errors and aromaticity problems.

Usage Examples

Example 1: Valid SMILES

Input:
validate_smiles_rdkit("CC(=O)OCC1=CC=CC=C1C(=O)O")
Output:
{
  "valid": true,
  "error_message": "[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
}

Example 2: Invalid Character

Input:
validate_smiles_rdkit("Oc1cc(.NCCO)ccc1")
Output:
{
  "valid": false,
  "error_message": "Invalid Character with Validity Vector: 111111011111111"
}
The . at position 6 is invalid in this context.

Example 3: Unclosed Ring

Input:
validate_smiles_rdkit("c1cccc")
Output:
{
  "valid": false,
  "error_message": "Unclosed Ring with Validity Vector: 011111"
}

Example 4: Multiple Errors

Input:
validate_smiles_rdkit("C1(C2)C3)")
Output:
{
  "valid": false,
  "error_message": "Unclosed Ring with Validity Vector: 010010111,Invalid Parentheses with Validity Vector: 111111110"
}

Example 5: Semantic Error (Valence)

Input:
validate_smiles_rdkit("F[Cl](=O)=O")
Output:
{
  "valid": false,
  "error_message": "AtomValenceException with Validity Vector: 1011111111"
}
Chlorine (position 2) has an invalid valence configuration.

Example 6: Ring Notation with %(N)

Input (Valid):
validate_smiles_rdkit("C%(1000)OC%(1000)")
Output:
{
  "valid": true,
  "error_message": "[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
}

Integration with Chemistry Parser

The tool uses the custom parse_smiles() function from chemistry_parser.py:
parsing_details = parse_smiles(smiles_string)
This function returns:
{
    "valid": bool,
    "details": "Syntax" | "Semantics" | "No Error",
    "error_message": str,  # RDKit error message
    "validity_vector": str  # Formatted error description
}

Error Logging

When invalid SMILES are detected, errors are logged to llasmol_response.errors:
if not parsing_details["valid"]:
    llasmol_response.errors += result["error_message"] + "$"
Errors are separated by $ for parsing by other components.

SMILES Tokenizer Pattern

The validator uses the SMILES tokenizer from the Molecular Transformer:
pattern = r"(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|
           =|#|-|\+|\\\\|\/|:|~|@|\?|>|\*|\$|\%[0-9]{2}|[0-9])"
This pattern recognizes:
  • Bracketed atoms: [NH3+], [C@@H]
  • Elements: Br, Cl, N, O, S, P, F, I
  • Aromatic: b, c, n, o, s, p
  • Bonds: =, #, -, \, /
  • Branches: (, )
  • Rings: 1-9, %10-%99
  • Special: ., +, @, etc.

Best Practices

  1. Validate before processing: Always validate SMILES before passing to answer_chemistry_query
  2. Parse validity vectors: Extract the position of errors from the binary string
  3. Handle multiple errors: Split comma-separated error messages
  4. Log errors: Store validation failures for debugging and analysis

Workflow Integration

Typical validation workflow:
# Step 1: Validate the SMILES
validation = validate_smiles_rdkit("C1CCOC1")

if validation["valid"]:
    # Step 2: Structure the query
    structured = structure_chem_prompt(f"What is the IUPAC name of {smiles}?")
    
    # Step 3: Get the answer
    result = answer_chemistry_query(structured["new_prompt"])
else:
    # Handle invalid SMILES
    print(f"Invalid SMILES: {validation['error_message']}")

RDKit Configuration

The tool configures RDKit logging:
from rdkit import RDLogger
from rdkit.rdBase import LogToPythonStderr

LogToPythonStderr()  # Capture RDKit errors to stderr
This ensures error messages are captured and processed by the validity vector system.

References

Build docs developers (and LLMs) love