Method Signature
Overview
Thewebsite_crawler method is the main entry point for crawling websites. It retrieves all content from a website, checks the Redis cache for existing uploads, and uploads new content to the configured OpenAI vector store.
Parameters
The base URL of the website to crawl. Must be a valid HTTP/HTTPS URL.
Streamlit progress bar component for displaying crawl progress. Should support
.progress(value, text) method.Return Value
This method does not return a value. It performs side effects:
- Updates Redis cache with URL-to-file-ID mappings
- Creates files in the
data/directory - Uploads content to OpenAI vector stores
- Updates the progress bar
Behavior
Cache Check
The method first checks if the URL has been previously crawled:vs_files sorted set and returns early without re-crawling.
Crawling Process
- Fetch Content: Calls
get_website_data()to crawl the entire website - Save Locally: Creates a text file in
data/directory named after the domain - Upload to OpenAI: Creates a file object with purpose “assistants”
- Cache Mapping: Stores bidirectional URL ↔ file ID mapping in Redis
- Vector Store: Adds the file to the configured vector store
- Track File: Adds vector store file to
vs_filessorted set with timestamp
Implementation Details
Redis Data Structure
Key-Value Mappings
url→file_id: Maps website URLs to OpenAI file IDsfile_id→url: Reverse mapping for lookups
Sorted Set
- Name:
vs_files - Members: Vector store file IDs
- Scores: Unix timestamps (creation or access time)
- Purpose: Track files for cleanup and manage vector store lifecycle
File Storage
Crawled content is saved locally before upload:- Directory:
data/ - Filename:
{domain}.txt(e.g.,example.com.txt) - Format: Prettified HTML using BeautifulSoup
- Purpose: Intermediate storage for OpenAI file upload
Example Usage
Progress Tracking
The progress bar is updated during the crawling process:- Progress starts at 0 when crawling begins
- Increments proportionally as each page is crawled
- Reaches 100% when all pages in the base domain are processed
- Progress is calculated based on links found on the homepage
Error Handling
The method relies on underlying methods for error handling:fetch_html()handles HTTP request failures- Returns
Nonefor failed requests, which are skipped - File system operations may raise
IOErrororOSError - OpenAI API calls may raise
openai.APIErrorexceptions
Performance Considerations
- Caching: Prevents redundant crawls and API calls
- Breadth-first search: Ensures systematic coverage
- Same-domain only: Limits scope to relevant content
- Sequential crawling: Processes one page at a time to avoid overwhelming servers