PubChem RAG

The PubChem RAG (Retrieval-Augmented Generation) module enriches chemistry queries by automatically fetching relevant compound information from the PubChem database.

Overview

The RAG pipeline extracts chemistry terms from your query, retrieves compound data from PubChem, and uses this context to provide more informed responses.

Term Extraction

Extract chemistry-related terms using SpaCy NLP

PubChem Query

Fetch compound descriptions from PubChem API

Context Augmentation

Combine retrieved data with original query

LLM Response

Generate informed answer using augmented context

Basic Usage

Using the —use_rag Flag

python plan_execute_agent/rdkit_agent.py \
  --query "What are the properties of aspirin?" \
  --use_rag

Direct Module Usage

from plan_execute_agent.pubchem_rag.query_chemistry import query_chemistry_related

query = "Tell me about caffeine and its effects"
response = query_chemistry_related(query)

print(response)

How It Works

1. Term Extraction

The extract_terms.py module uses SpaCy to identify chemistry-related terms:

# From plan_execute_agent/pubchem_rag/extract_terms.py:7
import spacy

nlp = spacy.load("en_core_web_sm")

def extract_chemistry_terms(query_text: str):
    """
    Extract chemistry-related terms from the query using SpaCy.
    """
    doc = nlp(query_text)
    # Initial filter for nouns and proper nouns through spacy
    terms = [
        token.text
        for token in doc
        if token.pos_ in {"NOUN", "PROPN"} and len(token.text) > 2
    ]
    return terms

Example:

from plan_execute_agent.pubchem_rag.extract_terms import extract_chemistry_terms

query = "What is the molecular weight of caffeine and theophylline?"
terms = extract_chemistry_terms(query)
print(terms)
# Output: ['weight', 'caffeine', 'theophylline']

2. PubChem Data Fetching

The pubchem_fetcher.py module queries multiple PubChem API endpoints:

# From plan_execute_agent/pubchem_rag/pubchem_fetcher.py:6
import requests

PUBCHEM_API_BASE = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"

def fetch_pubchem_data(terms):
    """
    Query PubChem endpoints:
    - /compound/cid/<term>/description/JSON
    - /compound/formula/<term>/description/JSON
    - /compound/name/<term>/description/JSON
    """
    context = []
    for term in terms:
        endpoints = [
            f"{PUBCHEM_API_BASE}/compound/cid/{term}/description/JSON",
            f"{PUBCHEM_API_BASE}/compound/formula/{term}/description/JSON",
            f"{PUBCHEM_API_BASE}/compound/name/{term}/description/JSON",
        ]

        for url in endpoints:
            try:
                response = requests.get(url, timeout=10)
                response.raise_for_status()
                descriptions = extract_pubchem_descriptions(response.json())
                if descriptions:
                    label = url.replace(PUBCHEM_API_BASE, "")
                    context.append(
                        f"**Term**: {term}\n**Endpoint**: {label}\n{descriptions}"
                    )
            except requests.exceptions.RequestException:
                pass

    return "\n\n".join(context)

3. Complete Pipeline

# From plan_execute_agent/pubchem_rag/query_chemistry.py:13
from .extract_terms import extract_chemistry_terms
from .pubchem_fetcher import fetch_pubchem_data
from .llm_response import generate_llm_response

def query_chemistry_related(query_text: str):
    """
    Parse the question, query PubChem for related terms, and generate an LLM response.
    """
    # Step 1: Parse chemistry-related terms from the question
    chemistry_terms = extract_chemistry_terms(query_text)
    if not chemistry_terms:
        return "No chemistry-related terms found in the query."

    print(f"Extracted Chemistry Terms: {chemistry_terms}")

    # Step 2: Query PubChem for information on the terms
    pubchem_context = fetch_pubchem_data(chemistry_terms)
    if not pubchem_context:
        return "No relevant information found on PubChem."

    # Step 3: Use the fetched context in the LLM
    response = generate_llm_response(pubchem_context, query_text)
    return response

Integration with Agent

When using process_input() with RAG enabled (plan_execute_agent/rdkit_agent.py:334):

async def process_input(
    input_prompt: str, image_path: str = None, use_rag: bool = False
) -> tuple:
    # ...
    
    # Perform PUBCHEM_RAG on the original input query
    from plan_execute_agent.pubchem_rag.query_chemistry import (
        query_chemistry_related,
    )
    
    additional_info = ""
    if use_rag:
        additional_info = await asyncio.to_thread(
            query_chemistry_related,
            input_prompt + "\n" + extracted_text,
        )
        try:
            additional_info = str(additional_info["text"])
        except:
            additional_info = str(additional_info)

    print("Additional Info from PubChem RAG: ", additional_info)
    
    # The additional_info is then included in the agent prompt
    edited_prompt = (
        # ...
        + "\nHere is the additional information from PubChem regarding the original query:\n"
        + additional_info
        # ...
    )

Use Cases

Compound Identification

Get detailed information about named compounds

Property Lookup

Retrieve physical and chemical properties

Bioactivity Data

Access biological activity information

Safety Information

Fetch toxicity and hazard data

Examples

Compound Information

from plan_execute_agent.pubchem_rag.query_chemistry import query_chemistry_related

query = "What is the molecular weight and solubility of ibuprofen?"
response = query_chemistry_related(query)
print(response)

Drug Interactions

query = "Tell me about the interaction between aspirin and warfarin"
response = query_chemistry_related(query)
print(response)

Chemical Classes

query = "What are the properties of beta-lactam antibiotics like penicillin?"
response = query_chemistry_related(query)
print(response)

With Agent for Complete Workflow

import asyncio
from plan_execute_agent.rdkit_agent import process_input

query = "Convert aspirin to SMILES and tell me about its properties"

result, completed, attempts, _, errors, _ = \
    asyncio.run(process_input(query, use_rag=True))

if completed:
    print(f"Result: {result}")
    print(f"Completed in {attempts} attempts")

Combining with Image Processing

Use both RAG and image extraction for comprehensive analysis:

import asyncio
from plan_execute_agent.rdkit_agent import process_input

query = "What are the properties of this molecule?"
image_path = "unknown_compound.png"

result, completed, attempts, _, errors, _ = \
    asyncio.run(process_input(query, image_path=image_path, use_rag=True))

print(result)
# Workflow:
# 1. Extract structure from image (GPT-4o)
# 2. Convert to chemical name
# 3. Fetch PubChem data about the compound
# 4. Generate comprehensive response

PubChem API Endpoints

The module queries three endpoint types:

By CID
By Name
By Formula

# Compound ID lookup
endpoint = f"{PUBCHEM_API_BASE}/compound/cid/{cid}/description/JSON"
# Example: /compound/cid/2244/description/JSON (aspirin)

# Chemical name lookup
endpoint = f"{PUBCHEM_API_BASE}/compound/name/{name}/description/JSON"
# Example: /compound/name/caffeine/description/JSON

# Molecular formula lookup
endpoint = f"{PUBCHEM_API_BASE}/compound/formula/{formula}/description/JSON"
# Example: /compound/formula/C9H8O4/description/JSON

Custom Term Extraction

You can customize term extraction for domain-specific needs:

import spacy
from typing import List

nlp = spacy.load("en_core_web_sm")

def extract_custom_terms(query_text: str, entity_types: List[str] = None) -> List[str]:
    """
    Extract terms with custom entity type filtering
    
    Args:
        query_text: Input query
        entity_types: SpaCy entity types to include (default: all nouns/proper nouns)
    """
    doc = nlp(query_text)
    
    terms = set()
    
    # Add nouns and proper nouns
    for token in doc:
        if token.pos_ in {"NOUN", "PROPN"} and len(token.text) > 2:
            terms.add(token.text)
    
    # Add named entities if specified
    if entity_types:
        for ent in doc.ents:
            if ent.label_ in entity_types:
                terms.add(ent.text)
    
    return list(terms)

# Example: Extract only chemical entities
query = "Compare the efficacy of aspirin and ibuprofen for pain relief"
terms = extract_custom_terms(query, entity_types=["CHEMICAL", "DRUG"])
print(terms)

Error Handling

from plan_execute_agent.pubchem_rag.query_chemistry import query_chemistry_related

def safe_rag_query(query_text: str, fallback: str = None):
    """
    Query with error handling and fallback
    """
    try:
        response = query_chemistry_related(query_text)
        
        # Check for empty response
        if not response or response == "No chemistry-related terms found in the query.":
            return fallback or "No information found"
        
        return response
    
    except Exception as e:
        print(f"RAG query failed: {e}")
        return fallback or "Query failed"

# Usage
result = safe_rag_query(
    "What is the structure of XYZ123?",
    fallback="Compound not found in PubChem"
)
print(result)

Performance Optimization

Caching Results

import json
import hashlib
from pathlib import Path

CACHE_DIR = Path("pubchem_cache")
CACHE_DIR.mkdir(exist_ok=True)

def cached_rag_query(query_text: str):
    """
    Query with caching to avoid redundant API calls
    """
    # Generate cache key
    cache_key = hashlib.md5(query_text.encode()).hexdigest()
    cache_file = CACHE_DIR / f"{cache_key}.json"
    
    # Check cache
    if cache_file.exists():
        with open(cache_file, 'r') as f:
            return json.load(f)
    
    # Query and cache
    response = query_chemistry_related(query_text)
    
    with open(cache_file, 'w') as f:
        json.dump(response, f)
    
    return response

Batch Processing

from plan_execute_agent.pubchem_rag.pubchem_fetcher import fetch_pubchem_data
from plan_execute_agent.pubchem_rag.extract_terms import extract_chemistry_terms

def batch_rag_queries(queries: List[str]):
    """
    Process multiple queries efficiently
    """
    # Collect all unique terms
    all_terms = set()
    for query in queries:
        terms = extract_chemistry_terms(query)
        all_terms.update(terms)
    
    # Single batch fetch
    context = fetch_pubchem_data(list(all_terms))
    
    # Generate responses
    results = []
    for query in queries:
        # Use cached context for each query
        response = generate_llm_response(context, query)
        results.append(response)
    
    return results

Best Practices

Query Formulation

Use specific chemical names when possible
Include context about what information you need
Mention related compounds for comparative queries
Be explicit about desired properties

Performance

Cache frequent queries to avoid API rate limits
Batch similar queries together
Use specific terms to reduce irrelevant results
Consider local database for high-volume usage

Data Quality

Verify critical information from multiple sources
Cross-reference with original PubChem entries
Be aware of data currency (last update dates)
Validate chemical structures independently

Limitations

Requires internet connection to PubChem API
Subject to PubChem API rate limits
Term extraction may miss domain-specific terminology
Quality depends on PubChem data completeness
Not all compounds are in PubChem database

Get Started

Core Concepts

Guides

Configuration

Overview

Basic Usage

Using the —use_rag Flag

Direct Module Usage

How It Works

1. Term Extraction

2. PubChem Data Fetching

3. Complete Pipeline

Integration with Agent

Use Cases

Compound Identification

Property Lookup

Bioactivity Data

Safety Information

Examples

Compound Information

Drug Interactions

Chemical Classes

With Agent for Complete Workflow

Combining with Image Processing

PubChem API Endpoints

Custom Term Extraction

Error Handling

Performance Optimization

Caching Results

Batch Processing

Best Practices

Limitations

See Also

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Configuration

Documentation Index

​Overview

​Basic Usage

​Using the —use_rag Flag

​Direct Module Usage

​How It Works

​1. Term Extraction

​2. PubChem Data Fetching

​3. Complete Pipeline

​Integration with Agent

​Use Cases

Compound Identification

Property Lookup

Bioactivity Data

Safety Information

​Examples

​Compound Information

​Drug Interactions

​Chemical Classes

​With Agent for Complete Workflow

​Combining with Image Processing

​PubChem API Endpoints

​Custom Term Extraction

​Error Handling

​Performance Optimization

​Caching Results

​Batch Processing

​Best Practices

​Limitations

​See Also

Build docs developers (and LLMs) love

Overview

Basic Usage

Using the —use_rag Flag

Direct Module Usage

How It Works

1. Term Extraction

2. PubChem Data Fetching

3. Complete Pipeline

Integration with Agent

Use Cases

Examples

Compound Information

Drug Interactions

Chemical Classes

With Agent for Complete Workflow

Combining with Image Processing

PubChem API Endpoints

Custom Term Extraction

Error Handling

Performance Optimization

Caching Results

Batch Processing

Best Practices

Limitations

See Also