Method Signature
Overview
Theextract_company_from_url static method extracts the primary domain name from a URL, handling various URL formats and TLD (Top-Level Domain) structures. It’s useful for identifying company names from website URLs.
Parameters
The URL to extract the company name from. Can be with or without the protocol (http:// or https://).Examples:
"https://www.example.com""example.co.uk""subdomain.example.com"
Return Value
The extracted company/domain name without TLD or subdomains.Examples:
"example.com"→"example""www.github.com"→"github""api.stripe.co.uk"→"stripe"
Implementation
TLD Handling
The method intelligently handles different domain structures:Simple Domains
Subdomains
Multi-part TLDs
URLs Without Protocol
Logic Breakdown
- Protocol Normalization: Adds
http://if no protocol is present - URL Parsing: Uses
urlparseto extract the hostname - Domain Splitting: Splits hostname by dots into parts
- TLD Detection:
- If last part before TLD is in
['co', 'com', 'net', 'org', 'gov'], use third-to-last part - Otherwise, use second-to-last part
- For simple domains (≤2 parts), use first part
- If last part before TLD is in
Example Usage
Use Cases
File Naming
Display Names
Cache Keys
Limitations
- Does not handle all international TLD structures (e.g.,
.co.jp,.com.brwith different patterns) - May not correctly identify company name for complex subdomain structures
- Returns the fallback to the original URL if parsing fails
- Does not validate if the URL is actually valid or reachable
Static Method Benefits
- No instance required: Can be called without creating a
CrawlUtilobject - Pure function: No side effects, same input always produces same output
- Utility function: Useful independently of crawling functionality
- Testable: Easy to unit test with various URL inputs