Overview
Domain debugging is a multi-stage process that cleans, validates, and normalizes domains from public blocklists before DNS validation. This ensures the final BlackWeb list contains only properly formatted, valid domains.Domains Debugging
The primary debugging step removes overlapping domains and performs homologation to Squid-Cache format.The Problem: Redundant Domains
Public blocklists often contain redundant entries:The Solution: Squid-Cache Format
BlackWeb uses Squid’s efficient domain matching format:.domain.comblocksdomain.comAND all subdomains (*.domain.com)- Removes redundant subdomain entries
- Strips IP address prefixes (
0.0.0.0,127.0.0.1,::1) - Removes
www.prefixes - Excludes invalid TLDs like
.comalone
Implementation
bwupdate/bwupdate.sh
- Removes allowlisted domains (
urls.txt) - Strips non-printable characters
- Trims whitespace
- Adds leading dot if missing
- Runs Python domain filter (
domfilter.py) - Ensures ASCII-only output
Allowlist Integration
BlackWeb excludes false positives usingdebugwl.txt:
- Google services: gmail.com, youtube.com, etc.
- Microsoft services: hotmail.com, outlook.com, etc.
- Yahoo services: yahoo.com, mail.yahoo.com, etc.
- University domains: From world university database
- Essential services: Banking, government, education sites
TLD Validation
Removes domains with invalid Top-Level Domains using a comprehensive list of public and private suffix TLDs.TLD Types Supported
| TLD Type | Description | Examples |
|---|---|---|
| ccTLD | Country Code TLD | .us, .uk, .de, .jp |
| ccSLD | Country Code Second-Level Domain | .co.uk, .com.au |
| sTLD | Sponsored TLD | .gov, .edu, .mil |
| uTLD | Unsponsored TLD | .int, .arpa |
| gSLD | Generic Second-Level Domain | .com.co, .net.au |
| gTLD | Generic TLD | .com, .net, .org |
| eTLD | Effective TLD | .blogspot.com |
| 4LD | Fourth-Level Domain | .edu.co.uk |
Input Example
Output Example
.exe is not a valid TLD and gets removed.
Implementation
bwupdate/bwupdate.sh
- Loads allowed TLDs from
lst/allowtlds.txt - Creates regex pattern for exclusion
- Filters out government TLDs (.gov, .mil, etc.)
- Removes standalone TLDs (using
tlds.txt) - Produces clean final list
Government TLD Exclusion
BlackWeb specifically excludes government-related domains:Input
Output
Debugging Punycode-IDN
Converts internationalized domain names (IDN) to Punycode/IDNA format for proper handling of non-ASCII characters.What is Punycode?
Punycode is an encoding system that represents Unicode characters in ASCII, prefixed withxn--.
Why needed?
- DNS only supports ASCII characters
- International domains use Unicode (Cyrillic, Arabic, Chinese, etc.)
- Homograph attacks use lookalike characters (e.g., Cyrillic ‘а’ vs Latin ‘a’)
RFC 1035 Compliance
Removes hostnames exceeding 63 characters:bwupdate/bwupdate.sh
- Hostnames > 63 characters (RFC 1035 limit)
- Uppercase letters (should be lowercase)
- Hyphens at start/end of labels
IDN Conversion Process
bwupdate/bwupdate.sh
- Extract printable ASCII domains (no conversion needed)
- Extract non-ASCII domains
- Convert non-ASCII to Punycode using
idn2tool - Verify output is ASCII-only
- Ensure leading dot format
Input Example
Output Example
президент.рф becomes xn--d1abbgf6aiiy.xn--p1ai
Character Mapping Examples
| Original | Character | Punycode | Full Domain |
|---|---|---|---|
| German | ü | bcher-kva | xn—bcher-kva.com |
| French | é | caf-dma | xn—caf-dma.fr |
| Spanish | ñ | espaa-rta | xn—espaa-rta.com |
| Turkish | ı | sendesk-wfb | xn—sendesk-wfb.com |
| Lithuanian | ų | mslaikas-qzb5f | xn—mslaikas-qzb5f.lt |
| Cyrillic | президент | d1abbgf6aiiy | xn—d1abbgf6aiiy.xn—p1ai |
Debugging non-ASCII Characters
The final cleanup removes corrupted entries, invalid encoding, and ensures strict ASCII compliance.The Problem: Encoding Corruption
Public blocklists may contain:- Corrupted UTF-8: Incomplete or broken multi-byte sequences
- CP1252: Windows-1252 encoding mixed with UTF-8
- ISO-8859-1: Latin-1 encoding
- Non-printable characters: Control characters, null bytes
- Invalid homograph attacks: Malformed lookalike characters
Input Example (Corrupted Encoding)
Output Example (Clean)
What Happened?
- Homograph attacks removed:
ămăzon.com,googlə.com, etc. - Corrupted encoding removed:
23andmê.com,òutlook.com, etc. - Invalid characters removed:
M-C$,-$,.$, etc. - Legitimate domain preserved:
google.com
Implementation
bwupdate/bwupdate.sh
- Detect current character encoding with
file -bi - Convert to UTF-8 with
iconv, ignoring invalid sequences - Filter to ASCII-only characters (
\x00-\x7F) - Save as
blackweb.txt
Earlier Stage Cleaning
bwupdate/bwupdate.sh
/[^a-zA-Z0-9.-]/d- Remove lines with invalid characters/^[^a-zA-Z0-9.]/d- Remove lines starting with invalid chars/[^a-zA-Z0-9]$/d- Remove lines ending with invalid chars/^[[:space:]]*$/d- Remove empty lines/[[:space:]]/d- Remove lines with whitespace/^[[:space:]]*#/d- Remove comments/\.{2,}/d- Remove consecutive dots
Protocol Prefix Removal
www.example.com→example.comftp.example.com→example.comhttp.example.com→example.comsmtp.example.com→example.com
These are subdomain removals, not protocol stripping. The domain
www.example.com becomes example.com, then gets prefixed with . for Squid format: .example.comSecurity: Homograph Attack Protection
Homograph attacks use lookalike characters to mimic legitimate domains:Examples of Homograph Attacks
| Fake Domain | Target | Attack Character |
|---|---|---|
| .ămăzon.com | amazon.com | Latin Small Letter A with Breve (Ä) |
| .googlÉ™.com | google.com | Latin Small Letter Schwa (É™) |
| .òutlook.com | outlook.com | Latin Small Letter O with Grave (ò) |
| .nętflix.com | netflix.com | Latin Small Letter E with Ogonek (ę) |
| .yăhoo.com | yahoo.com | Latin Small Letter A with Breve (Ä) |
Visual Similarity
To the human eye:ămăzonlooks likeamazongoogləlooks likegooglenętflixlooks likenetflix
- They’re completely different domains
- Used for phishing attacks
- Can bypass simple blocklists
BlackWeb Protection
BlackWeb handles homographs two ways:-
Valid IDN: Convert to Punycode
café.fr→xn--caf-dma.fr(legitimate)
-
Corrupted/Attack: Remove entirely
ămăzon.com→ REMOVED (phishing)
Character Set Normalization
Final conversion ensurescharset=us-ascii:
bwupdate/bwupdate.sh
- DNS resolution
- Domain comparison
- Squid-Cache integration
- No encoding issues
- No corrupted entries
Summary of Debugging Stages
Why This Matters
Performance
Clean, optimized domain list means faster Squid lookups and less memory usage
Accuracy
Removes false positives (legitimate sites) and invalid entries (dead domains)
Security
Detects and blocks homograph phishing attacks and IDN spoofing
Compatibility
Ensures domains work correctly with DNS, Squid, and all ASCII-based systems
Next Steps
Back to Overview
Return to the Update Process overview
