extract_company_from_url

Method Signature

@staticmethod
def extract_company_from_url(url):

Overview

The extract_company_from_url static method extracts the primary domain name from a URL, handling various URL formats and TLD (Top-Level Domain) structures. It’s useful for identifying company names from website URLs.

Parameters

url

str

required

The URL to extract the company name from. Can be with or without the protocol (http:// or https://).Examples:

"https://www.example.com"
"example.co.uk"
"subdomain.example.com"

Return Value

The extracted company/domain name without TLD or subdomains.Examples:

"example.com" → "example"
"www.github.com" → "github"
"api.stripe.co.uk" → "stripe"

Implementation

@staticmethod
def extract_company_from_url(url):
    # Parse the URL
    parsed_url = urlparse(
        url if url.startswith(('http://', 'https://')) else 'http://' + url
    )
    # Get the hostname
    hostname = parsed_url.hostname or url
    # Split the hostname into parts
    parts = hostname.split('.')

    # Handle different TLD structures
    if len(parts) > 2:
        if parts[-2] in ['co', 'com', 'net', 'org', 'gov']:
            # For domains like co.uk, com.au, etc.
            domain = parts[-3]
        else:
            # For regular subdomains
            domain = parts[-2]
    else:
        # For simple domains
        domain = parts[0]

    return domain

TLD Handling

The method intelligently handles different domain structures:

Simple Domains

CrawlUtil.extract_company_from_url("example.com")
# Returns: "example"

Subdomains

CrawlUtil.extract_company_from_url("www.example.com")
# Returns: "example"

CrawlUtil.extract_company_from_url("api.example.com")
# Returns: "example"

Multi-part TLDs

CrawlUtil.extract_company_from_url("example.co.uk")
# Returns: "example"

CrawlUtil.extract_company_from_url("example.com.au")
# Returns: "example"

CrawlUtil.extract_company_from_url("www.example.gov.uk")
# Returns: "example"

URLs Without Protocol

CrawlUtil.extract_company_from_url("example.com")
# Returns: "example"

CrawlUtil.extract_company_from_url("https://example.com")
# Returns: "example"

Logic Breakdown

Protocol Normalization: Adds http:// if no protocol is present
URL Parsing: Uses urlparse to extract the hostname
Domain Splitting: Splits hostname by dots into parts
TLD Detection:
- If last part before TLD is in ['co', 'com', 'net', 'org', 'gov'], use third-to-last part
- Otherwise, use second-to-last part
- For simple domains (≤2 parts), use first part

Example Usage

from crawl_util import CrawlUtil

# Extract company names from various URL formats
company1 = CrawlUtil.extract_company_from_url("https://www.stripe.com")
print(company1)  # Output: "stripe"

company2 = CrawlUtil.extract_company_from_url("api.github.com")
print(company2)  # Output: "github"

company3 = CrawlUtil.extract_company_from_url("bbc.co.uk")
print(company3)  # Output: "bbc"

company4 = CrawlUtil.extract_company_from_url("shop.example.com.au")
print(company4)  # Output: "example"

Use Cases

File Naming

url = "https://www.acmecorp.com/about"
company = CrawlUtil.extract_company_from_url(url)
filename = f"{company}_data.txt"
# filename = "acmecorp_data.txt"

Display Names

def display_company_info(url):
    company = CrawlUtil.extract_company_from_url(url)
    return f"Analyzing {company.capitalize()}..."

# Usage
message = display_company_info("https://tesla.com")
print(message)  # "Analyzing Tesla..."

Cache Keys

def generate_cache_key(url):
    company = CrawlUtil.extract_company_from_url(url)
    return f"company:{company}:data"

key = generate_cache_key("https://www.openai.com")
# key = "company:openai:data"

Limitations

Does not handle all international TLD structures (e.g., .co.jp, .com.br with different patterns)
May not correctly identify company name for complex subdomain structures
Returns the fallback to the original URL if parsing fails
Does not validate if the URL is actually valid or reachable

Static Method Benefits

No instance required: Can be called without creating a CrawlUtil object
Pure function: No side effects, same input always produces same output
Utility function: Useful independently of crawling functionality
Testable: Easy to unit test with various URL inputs

Core Functions

Utilities

Method Signature

Overview

Parameters

Return Value

Implementation

TLD Handling

Simple Domains

Subdomains

Multi-part TLDs

URLs Without Protocol

Logic Breakdown

Example Usage

Use Cases

File Naming

Display Names

Cache Keys

Limitations

Static Method Benefits

Build docs developers (and LLMs) love

Core Functions

Utilities

​Method Signature

​Overview

​Parameters

​Return Value

​Implementation

​TLD Handling

​Simple Domains

​Subdomains

​Multi-part TLDs

​URLs Without Protocol

​Logic Breakdown

​Example Usage

​Use Cases

​File Naming

​Display Names

​Cache Keys

​Limitations

​Static Method Benefits

Build docs developers (and LLMs) love

Method Signature

Overview

Parameters

Return Value

Implementation

TLD Handling

Simple Domains

Subdomains

Multi-part TLDs

URLs Without Protocol

Logic Breakdown

Example Usage

Use Cases

File Naming

Display Names

Cache Keys

Limitations

Static Method Benefits