website_crawler

Method Signature

def website_crawler(self, url, my_bar):

Overview

The website_crawler method is the main entry point for crawling websites. It retrieves all content from a website, checks the Redis cache for existing uploads, and uploads new content to the configured OpenAI vector store.

Parameters

url

str

required

The base URL of the website to crawl. Must be a valid HTTP/HTTPS URL.

my_bar

streamlit.delta_generator.DeltaGenerator

required

Streamlit progress bar component for displaying crawl progress. Should support .progress(value, text) method.

Return Value

This method does not return a value. It performs side effects:

Updates Redis cache with URL-to-file-ID mappings
Creates files in the data/ directory
Uploads content to OpenAI vector stores
Updates the progress bar

Behavior

Cache Check

The method first checks if the URL has been previously crawled:

if file_id := self.r.get(url):
    self.r.zadd("vs_files", {file_id: int(time.time())})
    return

If found in cache, it updates the timestamp in the vs_files sorted set and returns early without re-crawling.

Crawling Process

Fetch Content: Calls get_website_data() to crawl the entire website
Save Locally: Creates a text file in data/ directory named after the domain
Upload to OpenAI: Creates a file object with purpose “assistants”
Cache Mapping: Stores bidirectional URL ↔ file ID mapping in Redis
Vector Store: Adds the file to the configured vector store
Track File: Adds vector store file to vs_files sorted set with timestamp

Implementation Details

def website_crawler(self, url, my_bar):
    base_url = url
    data = self.get_website_data(base_url, my_bar)

    if file_id := self.r.get(url):
        self.r.zadd("vs_files", {file_id: int(time.time())})
        return

    # Uploading file into vector database
    file_name = urlparse(base_url).netloc + ".txt"
    os.makedirs('data', exist_ok=True)
    with open('data/' + file_name, "w") as text_file:
        text_file.write(data)

    file_ = self.client.files.create(
        file=open('data/' + file_name, "rb"), purpose="assistants"
    )
    # map url to file id and file id to url
    self.r.set(url, file_.id)
    self.r.set(file_.id, url)

    vector_store_file = self.client.beta.vector_stores.files.create(
        vector_store_id=self.vector_storage_id, file_id=file_.id
    )

    self.r.zadd(
        "vs_files", {vector_store_file.id: int(vector_store_file.created_at)}
    )

Redis Data Structure

Key-Value Mappings

url → file_id: Maps website URLs to OpenAI file IDs
file_id → url: Reverse mapping for lookups

Sorted Set

Name: vs_files
Members: Vector store file IDs
Scores: Unix timestamps (creation or access time)
Purpose: Track files for cleanup and manage vector store lifecycle

File Storage

Crawled content is saved locally before upload:

Directory: data/
Filename: {domain}.txt (e.g., example.com.txt)
Format: Prettified HTML using BeautifulSoup
Purpose: Intermediate storage for OpenAI file upload

Example Usage

import streamlit as st
from openai import OpenAI
from crawl_util import CrawlUtil

# Initialize components
client = OpenAI(api_key="sk-...")
vector_store_id = "vs_abc123"
progress_bar = st.progress(0)

# Create crawler instance
crawler = CrawlUtil(
    client=client,
    vector_storage_id=vector_store_id,
    progress_text="Analyzing company website..."
)

# Crawl a company website
try:
    crawler.website_crawler("https://example.com", progress_bar)
    st.success("Website crawled successfully!")
except Exception as e:
    st.error(f"Crawl failed: {e}")

Progress Tracking

The progress bar is updated during the crawling process:

Progress starts at 0 when crawling begins
Increments proportionally as each page is crawled
Reaches 100% when all pages in the base domain are processed
Progress is calculated based on links found on the homepage

Error Handling

The method relies on underlying methods for error handling:

fetch_html() handles HTTP request failures
Returns None for failed requests, which are skipped
File system operations may raise IOError or OSError
OpenAI API calls may raise openai.APIError exceptions

Performance Considerations

Caching: Prevents redundant crawls and API calls
Breadth-first search: Ensures systematic coverage
Same-domain only: Limits scope to relevant content
Sequential crawling: Processes one page at a time to avoid overwhelming servers

Core Functions

Utilities

Method Signature

Overview

Parameters

Return Value

Behavior

Cache Check

Crawling Process

Implementation Details

Redis Data Structure

Key-Value Mappings

Sorted Set

File Storage

Example Usage

Progress Tracking

Error Handling

Performance Considerations

Build docs developers (and LLMs) love

Core Functions

Utilities

​Method Signature

​Overview

​Parameters

​Return Value

​Behavior

​Cache Check

​Crawling Process

​Implementation Details

​Redis Data Structure

​Key-Value Mappings

​Sorted Set

​File Storage

​Example Usage

​Progress Tracking

​Error Handling

​Performance Considerations

Build docs developers (and LLMs) love

Method Signature

Overview

Parameters

Return Value

Behavior

Cache Check

Crawling Process

Implementation Details

Redis Data Structure

Key-Value Mappings

Sorted Set

File Storage

Example Usage

Progress Tracking

Error Handling

Performance Considerations