Skip to main content
The XRP Transaction Risk AI platform uses automated web crawling to gather business information from domains associated with wallet addresses. This intelligence feeds into the AI compliance analysis.

How web crawling works

The CrawlUtil class implements a breadth-first crawling algorithm that systematically explores websites:
web_crawler = CrawlUtil(
    client=client,
    vector_storage_id=vector_storage_id,
    progress_text="Operation in progress. Please wait."
)

web_crawler.website_crawler(f"https://{domain}", my_bar=my_bar)

Crawling workflow

1

Domain extraction

After wallet verification succeeds, the domain is extracted from the XRPScan account data:
verified, domain, twitter, balance, initial_balance = get_xrp_info(wallet_address)
2

Cache check

The system checks Redis cache to avoid re-crawling recently visited sites:
if file_id := self.r.get(url):
    self.r.zadd("vs_files", {file_id: int(time.time())})
    return  # Use cached version
3

Breadth-first crawl

The crawler visits the homepage and discovers all internal links:
def crawl_website(self, base_url, my_bar):
    visited = set()
    to_visit = [base_url]
    all_pages_content = []
    
    while to_visit:
        current_url = to_visit.pop(0)
        if current_url not in visited:
            html_content = self.fetch_html(current_url)
            if html_content:
                all_pages_content.append((current_url, html_content))
                links = self.parse_html_for_links(base_url, html_content)
                to_visit.extend(links - visited)
            visited.add(current_url)
4

Content extraction

HTML content is parsed using BeautifulSoup:
soup = BeautifulSoup(content, 'html.parser')
all_data += soup.prettify()
5

File creation and upload

The aggregated content is saved to a text file and uploaded to OpenAI:
file_name = urlparse(base_url).netloc + ".txt"
with open('data/' + file_name, "w") as text_file:
    text_file.write(data)

file_ = self.client.files.create(
    file=open('data/' + file_name, "rb"),
    purpose="assistants"
)
6

Vector storage

The file is added to OpenAI’s vector storage for semantic search:
vector_store_file = self.client.beta.vector_stores.files.create(
    vector_store_id=self.vector_storage_id,
    file_id=file_.id
)
The crawler intelligently discovers and filters links:
def parse_html_for_links(self, base_url, html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    links = set()
    
    for a_tag in soup.find_all('a', href=True):
        href = a_tag['href']
        
        # Handle relative URLs
        if href.startswith('/'):
            href = urljoin(base_url, href)
        elif not urlparse(href).netloc:
            href = urljoin(base_url, href)
        
        # Only keep internal links
        if base_url in href:
            links.add(href)
    
    return links

Progress tracking

The crawler provides real-time progress updates in the Streamlit UI:
my_bar = st.progress(0, text=progress_text)

# Calculate progress per link
if current_url == base_url:
    to_links = set(links)
    if to_links:
        progress_each = 1 / len(to_links)

# Update progress bar
progress_total += progress_each
if progress_total <= 1.0:
    my_bar.progress(progress_total, text=self.progress_text)
else:
    my_bar.progress(1.0, text=self.progress_text)
Progress is calculated based on the number of links discovered on the homepage. Each link represents one unit of progress.

Caching strategy

The crawler uses Redis to cache crawled content and avoid redundant work:

Cache structure

r = Redis(host='localhost', port=6379, db=0)

# Map URL to file ID
self.r.set(url, file_.id)

# Map file ID to URL (reverse lookup)
self.r.set(file_.id, url)

# Track file timestamps in sorted set
self.r.zadd("vs_files", {vector_store_file.id: int(vector_store_file.created_at)})

Benefits of caching

  • Faster assessments: Instant retrieval for previously crawled domains
  • Reduced API calls: Fewer requests to OpenAI file upload API
  • Cost savings: Avoid re-uploading duplicate content
  • Rate limit protection: Prevents hitting API throttles
Cache entries are timestamped using a sorted set (vs_files), allowing for cache expiration policies based on age.

Company name extraction

The crawler extracts a clean company name from the domain:
company_name = web_crawler.extract_company_from_url(f"https://{domain}")

@staticmethod
def extract_company_from_url(url):
    parsed_url = urlparse(
        url if url.startswith(('http://', 'https://')) else 'http://' + url
    )
    hostname = parsed_url.hostname or url
    parts = hostname.split('.')
    
    # Handle different TLD structures
    if len(parts) > 2:
        if parts[-2] in ['co', 'com', 'net', 'org', 'gov']:
            domain = parts[-3]  # For co.uk, com.au, etc.
        else:
            domain = parts[-2]  # For regular subdomains
    else:
        domain = parts[0]  # For simple domains
    
    return domain

Examples

Input URLExtracted Name
https://example.comexample
https://www.example.comexample
https://example.co.ukexample
https://app.example.comexample
https://example.com.auexample

Integration with AI assistants

The crawled content is used by all three AI assistants:
company_name = web_crawler.extract_company_from_url(f"https://{domain}")

# Summary prompt
summary_prompt = f"Provide a brief summary of the financial regulations relevant to the company: {company_name}"

# Report prompt
report_prompt = f"Identify any financial compliance red flags in the company data: {company_name} that might affect their business compliance."

# Resource prompt
resource_prompt = f"List the relevant financial regulatory documents for the company: {company_name}"
The assistants use the vector storage to perform semantic search over the crawled content, finding relevant information about:
  • Business operations and services
  • Regulatory disclosures
  • Licensing information
  • Terms of service and policies
  • Company structure and jurisdiction

Error handling

The crawler gracefully handles various error conditions:
def fetch_html(self, url):
    try:
        res = requests.get(url)
        if res.status_code == 200:
            return res.text
        else:
            return None
    except requests.RequestException as e:
        print(f"Request failed: {e}")
        return None
Failed requests return None and the crawler continues with other pages.
If the wallet verification doesn’t return a domain:
if not domain:
    st.error("No info")
    return
The crawling step is skipped entirely.
If the homepage has no internal links:
if to_links:
    progress_each = 1 / len(to_links)
else:
    progress_each = 0.01  # Small value for single-page sites
Progress still completes successfully.

Performance considerations

Crawl depth limitation

The crawler only visits links discovered on the homepage:
while to_visit:
    current_url = to_visit.pop(0)
    if current_url not in to_links and current_url != base_url:
        break  # Stop after homepage links are processed
This prevents excessive crawling while capturing the most important pages (About, Services, Contact, etc.).

Why limit depth?

  • Speed: Faster assessments (typically completes in under 30 seconds)
  • Relevance: Homepage links usually point to key business information
  • Cost: Fewer pages = smaller files = lower OpenAI API costs
  • Politeness: Reduces server load on target websites
The crawler respects the one-level depth limit. It will not recursively follow links beyond what’s directly linked from the homepage.

Best practices

For optimal crawling results:
  • Ensure Redis is running before starting assessments
  • Use domains with well-structured navigation
  • Verify the domain is accessible publicly (not behind auth)
  • Allow 15-30 seconds for crawling to complete
  • Check the progress bar for crawl status

Dependencies

The web crawler requires these Python packages:
import requests          # HTTP client
from bs4 import BeautifulSoup  # HTML parsing
from urllib.parse import urljoin, urlparse  # URL handling
from redis import Redis  # Caching
from openai import OpenAI  # File upload and vector storage
All dependencies are installed during installation.

Build docs developers (and LLMs) love