Documentation Index
Fetch the complete documentation index at: https://mintlify.com/pranavkrishnasuresh/chemAgent/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Theextract_terms.py module uses SpaCy’s natural language processing capabilities to identify chemistry-related terms from user queries. This is the first step in the RAG pipeline, extracting search terms that will be used to query PubChem.
Dependencies
Functions
extract_chemistry_terms
Extracts chemistry-related terms from natural language text using part-of-speech tagging.The user’s chemistry question or query text
List of extracted chemistry-related terms (nouns and proper nouns with length > 2)
Extraction Logic
The function uses SpaCy’s part-of-speech (POS) tagging to identify relevant terms:- NOUN: Common nouns (e.g., “molecule”, “compound”, “reaction”)
- PROPN: Proper nouns (e.g., “Benzene”, “Aspirin”)
- Length filter: Only terms with more than 2 characters to avoid uninformative words
Example Usage
Example Extractions
NLP Approach
Why SpaCy?
SpaCy provides:- Fast, production-ready NLP processing
- Accurate POS tagging for technical and scientific text
- Language-agnostic architecture (currently using English model)
Limitations
- Generic extraction: Filters for nouns but doesn’t validate if terms are actually chemistry-related
- No chemical entity recognition: Doesn’t specifically identify IUPAC names or chemical formulas
- Dependency on POS accuracy: Relies on SpaCy’s tagging, which may misclassify domain-specific terms
Future Enhancements
Potential improvements could include:- Chemical named entity recognition (ChemNER)
- IUPAC name validation
- Chemical formula parsing
- Domain-specific term filtering using chemistry lexicons
Integration in RAG Pipeline
This function is the first step called byquery_chemistry_related():
- Extract terms (this module) → Extract candidate chemistry terms
- Fetch PubChem data → Use terms to query chemical databases
- Generate response → Synthesize answer using retrieved context
Source Location
plan_execute_agent/pubchem_rag/extract_terms.py:7