Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/pranavkrishnasuresh/chemAgent/llms.txt

Use this file to discover all available pages before exploring further.

Overview

The extract_terms.py module uses SpaCy’s natural language processing capabilities to identify chemistry-related terms from user queries. This is the first step in the RAG pipeline, extracting search terms that will be used to query PubChem.

Dependencies

import spacy

# Requires the SpaCy English language model
nlp = spacy.load("en_core_web_sm")
The en_core_web_sm SpaCy model must be installed:
python -m spacy download en_core_web_sm

Functions

extract_chemistry_terms

Extracts chemistry-related terms from natural language text using part-of-speech tagging.
query_text
str
required
The user’s chemistry question or query text
terms
list[str]
List of extracted chemistry-related terms (nouns and proper nouns with length > 2)

Extraction Logic

The function uses SpaCy’s part-of-speech (POS) tagging to identify relevant terms:
  • NOUN: Common nouns (e.g., “molecule”, “compound”, “reaction”)
  • PROPN: Proper nouns (e.g., “Benzene”, “Aspirin”)
  • Length filter: Only terms with more than 2 characters to avoid uninformative words

Example Usage

from plan_execute_agent.pubchem_rag.extract_terms import extract_chemistry_terms

# Extract terms from a chemistry question
query = "What is the molecular structure of aspirin and how does it relate to benzene?"
terms = extract_chemistry_terms(query)
print(terms)
# Output: ['structure', 'aspirin', 'benzene']

Example Extractions

query = "Tell me about caffeine"
terms = extract_chemistry_terms(query)
# ['caffeine']

NLP Approach

Why SpaCy?

SpaCy provides:
  • Fast, production-ready NLP processing
  • Accurate POS tagging for technical and scientific text
  • Language-agnostic architecture (currently using English model)

Limitations

  • Generic extraction: Filters for nouns but doesn’t validate if terms are actually chemistry-related
  • No chemical entity recognition: Doesn’t specifically identify IUPAC names or chemical formulas
  • Dependency on POS accuracy: Relies on SpaCy’s tagging, which may misclassify domain-specific terms

Future Enhancements

Potential improvements could include:
  • Chemical named entity recognition (ChemNER)
  • IUPAC name validation
  • Chemical formula parsing
  • Domain-specific term filtering using chemistry lexicons

Integration in RAG Pipeline

This function is the first step called by query_chemistry_related():
  1. Extract terms (this module) → Extract candidate chemistry terms
  2. Fetch PubChem data → Use terms to query chemical databases
  3. Generate response → Synthesize answer using retrieved context

Source Location

plan_execute_agent/pubchem_rag/extract_terms.py:7

Build docs developers (and LLMs) love