Image Processing

ChemAgent can extract chemical names, formulas, structures, and other chemistry-related text from images using the gpt4o_chem_extract function.

Overview

The image processing pipeline uses GPT-4o’s vision capabilities to:

Extract chemical names and formulas from diagrams
Read SMILES strings from images
Identify structural features from molecular drawings
Process scanned documents and handwritten notes

Basic Usage

Standalone Image Extraction

import asyncio
from plan_execute_agent.rdkit_agent import gpt4o_chem_extract

# Extract chemistry text from an image
image_path = "molecule_diagram.png"
input_prompt = "Extract all chemical information from this image"

result = asyncio.run(gpt4o_chem_extract(input_prompt, image_path))
print(result)

With Agent Integration

The --image flag automatically integrates image extraction into the agent workflow:

python plan_execute_agent/rdkit_agent.py \
  --query "What is the IUPAC name of this molecule?" \
  --image "path/to/molecule.png"

How It Works

The image processing workflow (plan_execute_agent/rdkit_agent.py:257):

async def gpt4o_chem_extract(input_prompt: str, image_path: str = None) -> str:
    """
    Calls GPT-4o with optional image input to extract relevant chemistry-related text.

    Args:
        input_prompt (str): The input query.
        image_path (str, optional): Path to the image file containing chemistry text.

    Returns:
        str: Extracted relevant chemistry-related text.
    """
    client = openai.AsyncOpenAI()

    messages = [
        {
            "role": "system",
            "content": "You are an expert chemistry assistant. Extract only chemical names, formulas, or related text from the given prompt or image.",
        },
        {"role": "user", "content": input_prompt},
    ]

    if image_path:
        with open(image_path, "rb") as image_file:
            image_data = base64.b64encode(image_file.read()).decode("utf-8")

        messages.append(
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Please extract relevant chemistry-related text from this image as well.",
                    },
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{image_data}"},
                    },
                ],
            }
        )

    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        max_tokens=500,
        temperature=0.2,
    )

    chem_extracted = response.choices[0].message.content.strip()
    return chem_extracted

Integration with Queries

When using process_input() with an image, the extracted text is automatically combined with your query:

async def process_input(
    input_prompt: str, image_path: str = None, use_rag: bool = False
) -> tuple:
    # ... (from rdkit_agent.py:318)
    
    # GPT-4o extraction (optional)
    extracted_text = ""
    if image_path:
        print("Extracting Chemistry text from image...")
        extracted_text = await gpt4o_chem_extract(input_prompt, image_path)
        print("Chemistry text extracted from image:", extracted_text)
    
    # The extracted text is combined with the original query
    edited_prompt = (
        "'" + input_prompt +
        (f"\nExtracted Chemistry Text from image: {extracted_text}\n"
         if extracted_text else "\n")
        + "..."
    )

Supported Image Formats

The function supports common image formats:

PNG (.png)
JPEG (.jpg, .jpeg)
GIF (.gif)
BMP (.bmp)
WebP (.webp)

Use Cases

Document OCR

Extract chemical data from scanned papers and patents

Structure Recognition

Read molecular structures from diagrams

Lab Notes

Process handwritten chemical formulas

Whiteboard Capture

Digitize reactions from classroom photos

Examples

Extract SMILES from Structural Diagram

import asyncio
from plan_execute_agent.rdkit_agent import process_input

query = "Convert the molecule in this image to SMILES notation"
image_path = "benzene_structure.png"

result, completed, attempts, _, errors, _ = \
    asyncio.run(process_input(query, image_path=image_path))

if completed:
    print(f"SMILES: {result}")

Identify Compound from Image

query = "What is the IUPAC name of this compound?"
image_path = "unknown_molecule.jpg"

result, completed, _, _, _, _ = \
    asyncio.run(process_input(query, image_path=image_path))

print(f"IUPAC name: {result}")

Extract Reaction from Scheme

query = "What is the reaction shown in this scheme?"
image_path = "reaction_scheme.png"

result, completed, _, _, _, _ = \
    asyncio.run(process_input(query, image_path=image_path))

print(f"Reaction: {result}")

Batch Processing

import asyncio
import os
from plan_execute_agent.rdkit_agent import gpt4o_chem_extract

async def process_images(image_dir):
    results = {}
    
    for filename in os.listdir(image_dir):
        if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
            image_path = os.path.join(image_dir, filename)
            extracted = await gpt4o_chem_extract(
                "Extract all chemical information",
                image_path
            )
            results[filename] = extracted
    
    return results

# Usage
results = asyncio.run(process_images("molecule_images/"))
for filename, extracted in results.items():
    print(f"\n{filename}:")
    print(extracted)

Combining with RAG

Use both image extraction and PubChem RAG for comprehensive analysis:

python plan_execute_agent/rdkit_agent.py \
  --query "What are the properties of this molecule?" \
  --image "molecule.png" \
  --use_rag

import asyncio
from plan_execute_agent.rdkit_agent import process_input

query = "Describe the biological activity of this compound"
image_path = "drug_molecule.png"

result, completed, attempts, _, _, _ = \
    asyncio.run(process_input(query, image_path=image_path, use_rag=True))

print(result)
# Combines:
# 1. Extracted chemical name/structure from image
# 2. PubChem data about the compound
# 3. LlaSMol analysis

Best Practices

Image Quality

Use high-resolution images (at least 300 DPI for scans)
Ensure good contrast between text/structures and background
Avoid blurry or distorted images
Crop to relevant content when possible

Query Formulation

Be specific about what to extract
Mention if image contains multiple compounds
Indicate expected format (SMILES, IUPAC, formula)
Provide context when ambiguity is possible

Validation

Always validate extracted SMILES with validate_smiles_rdkit
Cross-reference extracted names with databases
Review complex structures manually
Use multiple angles/views for 3D structures

Performance

Process images asynchronously for better performance
Cache results for repeated queries
Batch similar images together
Consider preprocessing (crop, enhance) before extraction

Error Handling

import asyncio
import os
from plan_execute_agent.rdkit_agent import process_input

async def safe_image_processing(query, image_path):
    # Validate image path
    if not os.path.isfile(image_path):
        return {"error": f"Image not found: {image_path}"}
    
    # Check file size (GPT-4o has limits)
    if os.path.getsize(image_path) > 20 * 1024 * 1024:  # 20MB
        return {"error": "Image too large (max 20MB)"}
    
    try:
        result, completed, attempts, _, errors, _ = \
            await process_input(query, image_path=image_path)
        
        if completed:
            return {"result": result, "attempts": attempts}
        else:
            return {"error": f"Processing failed: {errors}"}
    
    except Exception as e:
        return {"error": f"Exception: {str(e)}"}

# Usage
result = asyncio.run(safe_image_processing(
    "What is this molecule?",
    "molecule.png"
))

if "error" in result:
    print(f"Error: {result['error']}")
else:
    print(f"Success: {result['result']}")

Limitations

Complex 3D structures may not be accurately interpreted
Handwritten formulas might be misread
Very low quality or heavily annotated images can cause issues
Stereochemistry in 2D projections may be ambiguous
Non-standard notation might not be recognized

Advanced Usage

Custom Extraction Prompts

from plan_execute_agent.rdkit_agent import gpt4o_chem_extract
import asyncio

async def extract_with_context(image_path, context):
    prompt = f"""
    Context: {context}
    
    Extract all chemical information from the image that relates to this context.
    Focus on:
    - SMILES representations
    - IUPAC names
    - Molecular formulas
    - Reaction conditions
    """
    
    result = await gpt4o_chem_extract(prompt, image_path)
    return result

# Usage
result = asyncio.run(extract_with_context(
    "synthesis_scheme.png",
    "Aspirin synthesis from salicylic acid"
))
print(result)

Get Started

Core Concepts

Guides

Configuration

Overview

Basic Usage

Standalone Image Extraction

With Agent Integration

How It Works

Integration with Queries

Supported Image Formats

Use Cases

Document OCR

Structure Recognition

Lab Notes

Whiteboard Capture

Examples

Extract SMILES from Structural Diagram

Identify Compound from Image

Extract Reaction from Scheme

Batch Processing

Combining with RAG

Best Practices

Error Handling

Limitations

Advanced Usage

Custom Extraction Prompts

See Also

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Configuration

Documentation Index

​Overview

​Basic Usage

​Standalone Image Extraction

​With Agent Integration

​How It Works

​Integration with Queries

​Supported Image Formats

​Use Cases

Document OCR

Structure Recognition

Lab Notes

Whiteboard Capture

​Examples

​Extract SMILES from Structural Diagram

​Identify Compound from Image

​Extract Reaction from Scheme

​Batch Processing

​Combining with RAG

​Best Practices

​Error Handling

​Limitations

​Advanced Usage

​Custom Extraction Prompts

​See Also

Build docs developers (and LLMs) love

Overview

Basic Usage

Standalone Image Extraction

With Agent Integration

How It Works

Integration with Queries

Supported Image Formats

Use Cases

Examples

Extract SMILES from Structural Diagram

Identify Compound from Image

Extract Reaction from Scheme

Batch Processing

Combining with RAG

Best Practices

Error Handling

Limitations

Advanced Usage

Custom Extraction Prompts

See Also