Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/pranavkrishnasuresh/chemAgent/llms.txt

Use this file to discover all available pages before exploring further.

The PubChem RAG (Retrieval-Augmented Generation) module enriches chemistry queries by automatically fetching relevant compound information from the PubChem database.

Overview

The RAG pipeline extracts chemistry terms from your query, retrieves compound data from PubChem, and uses this context to provide more informed responses.
1

Term Extraction

Extract chemistry-related terms using SpaCy NLP
2

PubChem Query

Fetch compound descriptions from PubChem API
3

Context Augmentation

Combine retrieved data with original query
4

LLM Response

Generate informed answer using augmented context

Basic Usage

Using the —use_rag Flag

python plan_execute_agent/rdkit_agent.py \
  --query "What are the properties of aspirin?" \
  --use_rag

Direct Module Usage

from plan_execute_agent.pubchem_rag.query_chemistry import query_chemistry_related

query = "Tell me about caffeine and its effects"
response = query_chemistry_related(query)

print(response)

How It Works

1. Term Extraction

The extract_terms.py module uses SpaCy to identify chemistry-related terms:
# From plan_execute_agent/pubchem_rag/extract_terms.py:7
import spacy

nlp = spacy.load("en_core_web_sm")

def extract_chemistry_terms(query_text: str):
    """
    Extract chemistry-related terms from the query using SpaCy.
    """
    doc = nlp(query_text)
    # Initial filter for nouns and proper nouns through spacy
    terms = [
        token.text
        for token in doc
        if token.pos_ in {"NOUN", "PROPN"} and len(token.text) > 2
    ]
    return terms
Example:
from plan_execute_agent.pubchem_rag.extract_terms import extract_chemistry_terms

query = "What is the molecular weight of caffeine and theophylline?"
terms = extract_chemistry_terms(query)
print(terms)
# Output: ['weight', 'caffeine', 'theophylline']

2. PubChem Data Fetching

The pubchem_fetcher.py module queries multiple PubChem API endpoints:
# From plan_execute_agent/pubchem_rag/pubchem_fetcher.py:6
import requests

PUBCHEM_API_BASE = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"

def fetch_pubchem_data(terms):
    """
    Query PubChem endpoints:
    - /compound/cid/<term>/description/JSON
    - /compound/formula/<term>/description/JSON
    - /compound/name/<term>/description/JSON
    """
    context = []
    for term in terms:
        endpoints = [
            f"{PUBCHEM_API_BASE}/compound/cid/{term}/description/JSON",
            f"{PUBCHEM_API_BASE}/compound/formula/{term}/description/JSON",
            f"{PUBCHEM_API_BASE}/compound/name/{term}/description/JSON",
        ]

        for url in endpoints:
            try:
                response = requests.get(url, timeout=10)
                response.raise_for_status()
                descriptions = extract_pubchem_descriptions(response.json())
                if descriptions:
                    label = url.replace(PUBCHEM_API_BASE, "")
                    context.append(
                        f"**Term**: {term}\n**Endpoint**: {label}\n{descriptions}"
                    )
            except requests.exceptions.RequestException:
                pass

    return "\n\n".join(context)

3. Complete Pipeline

# From plan_execute_agent/pubchem_rag/query_chemistry.py:13
from .extract_terms import extract_chemistry_terms
from .pubchem_fetcher import fetch_pubchem_data
from .llm_response import generate_llm_response

def query_chemistry_related(query_text: str):
    """
    Parse the question, query PubChem for related terms, and generate an LLM response.
    """
    # Step 1: Parse chemistry-related terms from the question
    chemistry_terms = extract_chemistry_terms(query_text)
    if not chemistry_terms:
        return "No chemistry-related terms found in the query."

    print(f"Extracted Chemistry Terms: {chemistry_terms}")

    # Step 2: Query PubChem for information on the terms
    pubchem_context = fetch_pubchem_data(chemistry_terms)
    if not pubchem_context:
        return "No relevant information found on PubChem."

    # Step 3: Use the fetched context in the LLM
    response = generate_llm_response(pubchem_context, query_text)
    return response

Integration with Agent

When using process_input() with RAG enabled (plan_execute_agent/rdkit_agent.py:334):
async def process_input(
    input_prompt: str, image_path: str = None, use_rag: bool = False
) -> tuple:
    # ...
    
    # Perform PUBCHEM_RAG on the original input query
    from plan_execute_agent.pubchem_rag.query_chemistry import (
        query_chemistry_related,
    )
    
    additional_info = ""
    if use_rag:
        additional_info = await asyncio.to_thread(
            query_chemistry_related,
            input_prompt + "\n" + extracted_text,
        )
        try:
            additional_info = str(additional_info["text"])
        except:
            additional_info = str(additional_info)

    print("Additional Info from PubChem RAG: ", additional_info)
    
    # The additional_info is then included in the agent prompt
    edited_prompt = (
        # ...
        + "\nHere is the additional information from PubChem regarding the original query:\n"
        + additional_info
        # ...
    )

Use Cases

Compound Identification

Get detailed information about named compounds

Property Lookup

Retrieve physical and chemical properties

Bioactivity Data

Access biological activity information

Safety Information

Fetch toxicity and hazard data

Examples

Compound Information

from plan_execute_agent.pubchem_rag.query_chemistry import query_chemistry_related

query = "What is the molecular weight and solubility of ibuprofen?"
response = query_chemistry_related(query)
print(response)

Drug Interactions

query = "Tell me about the interaction between aspirin and warfarin"
response = query_chemistry_related(query)
print(response)

Chemical Classes

query = "What are the properties of beta-lactam antibiotics like penicillin?"
response = query_chemistry_related(query)
print(response)

With Agent for Complete Workflow

import asyncio
from plan_execute_agent.rdkit_agent import process_input

query = "Convert aspirin to SMILES and tell me about its properties"

result, completed, attempts, _, errors, _ = \
    asyncio.run(process_input(query, use_rag=True))

if completed:
    print(f"Result: {result}")
    print(f"Completed in {attempts} attempts")

Combining with Image Processing

Use both RAG and image extraction for comprehensive analysis:
import asyncio
from plan_execute_agent.rdkit_agent import process_input

query = "What are the properties of this molecule?"
image_path = "unknown_compound.png"

result, completed, attempts, _, errors, _ = \
    asyncio.run(process_input(query, image_path=image_path, use_rag=True))

print(result)
# Workflow:
# 1. Extract structure from image (GPT-4o)
# 2. Convert to chemical name
# 3. Fetch PubChem data about the compound
# 4. Generate comprehensive response

PubChem API Endpoints

The module queries three endpoint types:
# Compound ID lookup
endpoint = f"{PUBCHEM_API_BASE}/compound/cid/{cid}/description/JSON"
# Example: /compound/cid/2244/description/JSON (aspirin)

Custom Term Extraction

You can customize term extraction for domain-specific needs:
import spacy
from typing import List

nlp = spacy.load("en_core_web_sm")

def extract_custom_terms(query_text: str, entity_types: List[str] = None) -> List[str]:
    """
    Extract terms with custom entity type filtering
    
    Args:
        query_text: Input query
        entity_types: SpaCy entity types to include (default: all nouns/proper nouns)
    """
    doc = nlp(query_text)
    
    terms = set()
    
    # Add nouns and proper nouns
    for token in doc:
        if token.pos_ in {"NOUN", "PROPN"} and len(token.text) > 2:
            terms.add(token.text)
    
    # Add named entities if specified
    if entity_types:
        for ent in doc.ents:
            if ent.label_ in entity_types:
                terms.add(ent.text)
    
    return list(terms)

# Example: Extract only chemical entities
query = "Compare the efficacy of aspirin and ibuprofen for pain relief"
terms = extract_custom_terms(query, entity_types=["CHEMICAL", "DRUG"])
print(terms)

Error Handling

from plan_execute_agent.pubchem_rag.query_chemistry import query_chemistry_related

def safe_rag_query(query_text: str, fallback: str = None):
    """
    Query with error handling and fallback
    """
    try:
        response = query_chemistry_related(query_text)
        
        # Check for empty response
        if not response or response == "No chemistry-related terms found in the query.":
            return fallback or "No information found"
        
        return response
    
    except Exception as e:
        print(f"RAG query failed: {e}")
        return fallback or "Query failed"

# Usage
result = safe_rag_query(
    "What is the structure of XYZ123?",
    fallback="Compound not found in PubChem"
)
print(result)

Performance Optimization

Caching Results

import json
import hashlib
from pathlib import Path

CACHE_DIR = Path("pubchem_cache")
CACHE_DIR.mkdir(exist_ok=True)

def cached_rag_query(query_text: str):
    """
    Query with caching to avoid redundant API calls
    """
    # Generate cache key
    cache_key = hashlib.md5(query_text.encode()).hexdigest()
    cache_file = CACHE_DIR / f"{cache_key}.json"
    
    # Check cache
    if cache_file.exists():
        with open(cache_file, 'r') as f:
            return json.load(f)
    
    # Query and cache
    response = query_chemistry_related(query_text)
    
    with open(cache_file, 'w') as f:
        json.dump(response, f)
    
    return response

Batch Processing

from plan_execute_agent.pubchem_rag.pubchem_fetcher import fetch_pubchem_data
from plan_execute_agent.pubchem_rag.extract_terms import extract_chemistry_terms

def batch_rag_queries(queries: List[str]):
    """
    Process multiple queries efficiently
    """
    # Collect all unique terms
    all_terms = set()
    for query in queries:
        terms = extract_chemistry_terms(query)
        all_terms.update(terms)
    
    # Single batch fetch
    context = fetch_pubchem_data(list(all_terms))
    
    # Generate responses
    results = []
    for query in queries:
        # Use cached context for each query
        response = generate_llm_response(context, query)
        results.append(response)
    
    return results

Best Practices

  • Use specific chemical names when possible
  • Include context about what information you need
  • Mention related compounds for comparative queries
  • Be explicit about desired properties
  • Cache frequent queries to avoid API rate limits
  • Batch similar queries together
  • Use specific terms to reduce irrelevant results
  • Consider local database for high-volume usage
  • Verify critical information from multiple sources
  • Cross-reference with original PubChem entries
  • Be aware of data currency (last update dates)
  • Validate chemical structures independently

Limitations

  • Requires internet connection to PubChem API
  • Subject to PubChem API rate limits
  • Term extraction may miss domain-specific terminology
  • Quality depends on PubChem data completeness
  • Not all compounds are in PubChem database

See Also

Build docs developers (and LLMs) love