Overview
TheCrawlUtil class provides functionality for crawling websites, extracting content, and storing it in OpenAI vector stores with Redis-based caching. It handles breadth-first website traversal, HTML parsing, and progress tracking.
Class Definition
Initialization
OpenAI client instance for interacting with the OpenAI API and vector stores
The ID of the OpenAI vector store where crawled content will be uploaded
Text to display in the progress bar during crawling operations
Redis Integration
The class uses a shared Redis connection for caching:- Host: localhost
- Port: 6379
- Database: 0
Caching Strategy
- URLs are mapped to file IDs to avoid re-uploading duplicate content
- File IDs are mapped back to URLs for reverse lookups
- A sorted set
vs_filestracks vector store files with timestamps for cleanup
Key Features
- Breadth-first crawling: Systematically explores website structure
- Same-domain filtering: Only crawls links within the base domain
- Progress tracking: Integrates with Streamlit progress bars
- Redis caching: Prevents duplicate uploads of previously crawled sites
- Vector storage: Automatically uploads content to OpenAI vector stores
Methods
website_crawler(url, my_bar)- Main crawling methodextract_company_from_url(url)- Static method for domain extractionfetch_html(url)- Fetches HTML content from a URLparse_html_for_links(base_url, html_content)- Extracts links from HTMLcrawl_website(base_url, my_bar)- Performs the crawling processget_website_data(base_url, my_bar)- Retrieves and formats crawled data
Example Usage
Dependencies
openai- OpenAI Python SDKredis- Redis client for Pythonrequests- HTTP library for fetching web contentbeautifulsoup4- HTML parsing libraryurllib.parse- URL manipulation utilities