Skip to main content

Overview

Domain debugging is a multi-stage process that cleans, validates, and normalizes domains from public blocklists before DNS validation. This ensures the final BlackWeb list contains only properly formatted, valid domains.

Domains Debugging

The primary debugging step removes overlapping domains and performs homologation to Squid-Cache format.

The Problem: Redundant Domains

Public blocklists often contain redundant entries:
com
.com
.domain.com
domain.com
0.0.0.0 domain.com
127.0.0.1 domain.com
::1 domain.com
domain.com.co
foo.bar.subdomain.domain.com
.subdomain.domain.com.co
www.domain.com
www.foo.bar.subdomain.domain.com
domain.co.uk
xxx.foo.bar.subdomain.domain.co.uk

The Solution: Squid-Cache Format

BlackWeb uses Squid’s efficient domain matching format:
.domain.com
.domain.com.co
.domain.co.uk
How it works:
  • .domain.com blocks domain.com AND all subdomains (*.domain.com)
  • Removes redundant subdomain entries
  • Strips IP address prefixes (0.0.0.0, 127.0.0.1, ::1)
  • Removes www. prefixes
  • Excludes invalid TLDs like .com alone

Implementation

bwupdate/bwupdate.sh
# DEBUGGING DOMAINS
echo "${bw07[$lang]}"
grep -Fvxf urls.txt capture.txt | sed 's/[^[:print:]\n]//g' | sed 's/^[[:space:]]*//;s/[[:space:]]*$//' | awk '{if ($1 !~ /^\./) print "." $1; else print $1}' | sort -u > cleancapture.txt
$wgetd https://raw.githubusercontent.com/maravento/vault/master/dofi/domfilter.py -O domfilter.py >/dev/null 2>&1
python domfilter.py --input cleancapture.txt
grep -Fvxf urls.txt output.txt | grep -P "^[\x00-\x7F]+$" | sort -u > finalclean
echo "OK"
The process:
  1. Removes allowlisted domains (urls.txt)
  2. Strips non-printable characters
  3. Trims whitespace
  4. Adds leading dot if missing
  5. Runs Python domain filter (domfilter.py)
  6. Ensures ASCII-only output

Allowlist Integration

BlackWeb excludes false positives using debugwl.txt:
  • Google services: gmail.com, youtube.com, etc.
  • Microsoft services: hotmail.com, outlook.com, etc.
  • Yahoo services: yahoo.com, mail.yahoo.com, etc.
  • University domains: From world university database
  • Essential services: Banking, government, education sites

TLD Validation

Removes domains with invalid Top-Level Domains using a comprehensive list of public and private suffix TLDs.

TLD Types Supported

TLD TypeDescriptionExamples
ccTLDCountry Code TLD.us, .uk, .de, .jp
ccSLDCountry Code Second-Level Domain.co.uk, .com.au
sTLDSponsored TLD.gov, .edu, .mil
uTLDUnsponsored TLD.int, .arpa
gSLDGeneric Second-Level Domain.com.co, .net.au
gTLDGeneric TLD.com, .net, .org
eTLDEffective TLD.blogspot.com
4LDFourth-Level Domain.edu.co.uk

Input Example

.domain.exe
.domain.com
.domain.edu.co

Output Example

.domain.com
.domain.edu.co
Note: .exe is not a valid TLD and gets removed.

Implementation

bwupdate/bwupdate.sh
# TLD FINAL FILTER (Exclude AllowTLDs .gov, .mil, etc., delete TLDs and NO-ASCII lines
echo "${bw11[$lang]}"
regex_ext=$(grep -v '^#' lst/allowtlds.txt | sed 's/$/\$/' | tr '\n' '|')
new_regex_ext="${regex_ext%|}"
grep -E -v "$new_regex_ext" blackweb_tmp | sort -u > blackweb_tmp2
comm -23 <(sort blackweb_tmp2) <(sort tlds.txt) > blackweb.txt
This process:
  1. Loads allowed TLDs from lst/allowtlds.txt
  2. Creates regex pattern for exclusion
  3. Filters out government TLDs (.gov, .mil, etc.)
  4. Removes standalone TLDs (using tlds.txt)
  5. Produces clean final list

Government TLD Exclusion

BlackWeb specifically excludes government-related domains:

Input

.argentina.gob.ar
.mydomain.com
.gob.mx
.gov.uk
.navy.mil

Output

.mydomain.com
Rationale: Government domains should not be blocked to avoid breaking essential services and official websites.

Debugging Punycode-IDN

Converts internationalized domain names (IDN) to Punycode/IDNA format for proper handling of non-ASCII characters.

What is Punycode?

Punycode is an encoding system that represents Unicode characters in ASCII, prefixed with xn--. Why needed?
  • DNS only supports ASCII characters
  • International domains use Unicode (Cyrillic, Arabic, Chinese, etc.)
  • Homograph attacks use lookalike characters (e.g., Cyrillic ‘а’ vs Latin ‘a’)

RFC 1035 Compliance

Removes hostnames exceeding 63 characters:
bwupdate/bwupdate.sh
# RFC 1035 Partial
sed '/[^.]\{64\}/d' stage1 \
| grep -vP '[A-Z]' \
| grep -vP '(^|\.)-|-($|\.)' \
| sed 's/^\.//g' \
| sort -u > stage2
Filters:
  • Hostnames > 63 characters (RFC 1035 limit)
  • Uppercase letters (should be lowercase)
  • Hyphens at start/end of labels

IDN Conversion Process

bwupdate/bwupdate.sh
# DEBUGGING IDN
{ 
  LC_ALL=C grep -v '[^[:print:]]' stage2
  grep -P "[^[:ascii:]]" stage2 | idn2 
} | grep -P '^[\x00-\x7F]+$' \
  | awk '{if ($1 !~ /^\./) print "." $1; else print $1}' \
  | sort -u > capture.txt
Process:
  1. Extract printable ASCII domains (no conversion needed)
  2. Extract non-ASCII domains
  3. Convert non-ASCII to Punycode using idn2 tool
  4. Verify output is ASCII-only
  5. Ensure leading dot format

Input Example

bücher.com
café.fr
españa.com
köln-düsseldorfer-rhein-main.de
mañana.com
mūsųlaikas.lt
sendesık.com
президент.рф

Output Example

xn--bcher-kva.com
xn--caf-dma.fr
xn--d1abbgf6aiiy.xn--p1ai
xn--espaa-rta.com
xn--kln-dsseldorfer-rhein-main-cvc6o.de
xn--maana-pta.com
xn--mslaikas-qzb5f.lt
xn--sendesk-wfb.com
Note: The Cyrillic domain президент.рф becomes xn--d1abbgf6aiiy.xn--p1ai

Character Mapping Examples

OriginalCharacterPunycodeFull Domain
Germanübcher-kvaxn—bcher-kva.com
Frenchécaf-dmaxn—caf-dma.fr
Spanishñespaa-rtaxn—espaa-rta.com
Turkishısendesk-wfbxn—sendesk-wfb.com
Lithuanianųmslaikas-qzb5fxn—mslaikas-qzb5f.lt
Cyrillicпрезидентd1abbgf6aiiyxn—d1abbgf6aiiy.xn—p1ai

Debugging non-ASCII Characters

The final cleanup removes corrupted entries, invalid encoding, and ensures strict ASCII compliance.

The Problem: Encoding Corruption

Public blocklists may contain:
  • Corrupted UTF-8: Incomplete or broken multi-byte sequences
  • CP1252: Windows-1252 encoding mixed with UTF-8
  • ISO-8859-1: Latin-1 encoding
  • Non-printable characters: Control characters, null bytes
  • Invalid homograph attacks: Malformed lookalike characters

Input Example (Corrupted Encoding)

M-C$
-$
.$
0$
1$
23andmê.com
.òutlook.com
.ălibăbă.com
.ămăzon.com
.ăvăst.com
.amùazon.com
.aməzon.com
.avalón.com
.bĺnance.com
.bitdẹfender.com
.blóckchain.site
.blockchaiǹ.com
.cashpluÈ™.com
.dẹll.com
.diócesisdebarinas.org
.disnẹylandparis.com
.ebăy.com
.əməzon.com
.evo-bancó.com
.goglÄ™.com
.gooÄŸle.com
.googļę.com
.googlÉ™.com
.google.com
.ibẹria.com
.imgúr.com
.lloydÅŸbank.com
.mýetherwallet.com
.mrgreęn.com
.myẹthẹrwallet.com
.myẹthernwallet.com
.myethẹrnwallet.com
.myetheá¹™wallet.com
.myethernwallẹt.com
.nętflix.com
.paxfùll.com
.türkiyeisbankasi.com
.třezor.com
.westernúnion.com
.yòutube.com
.yăhoo.com
.yoütübe.co
.yoütübe.com
.yoütu.be

Output Example (Clean)

.google.com
Only valid ASCII domains survive!

What Happened?

  • Homograph attacks removed: ămăzon.com, googlÉ™.com, etc.
  • Corrupted encoding removed: 23andmê.com, òutlook.com, etc.
  • Invalid characters removed: M-C$, -$, .$, etc.
  • Legitimate domain preserved: google.com

Implementation

bwupdate/bwupdate.sh
iconv -f "$(file -bi final.txt | sed 's/.*charset=//')" -t UTF-8//IGNORE final.txt | grep -P '^[\x00-\x7F]+$' > blackweb.txt
Process:
  1. Detect current character encoding with file -bi
  2. Convert to UTF-8 with iconv, ignoring invalid sequences
  3. Filter to ASCII-only characters (\x00-\x7F)
  4. Save as blackweb.txt

Earlier Stage Cleaning

bwupdate/bwupdate.sh
# CAPTURING DOMAINS
find bwtmp -type f -not -iname "*pdf" \
  -execdir grep -oiE "([a-zA-Z0-9][a-zA-Z0-9-]{1,61}\.){1,}(\.?[a-zA-Z]{2,}){1,}" {} \; \
| sed -r 's:(^\.*.?(www|ftp|ftps|ftpes|sftp|pop|pop3|smtp|imap|http|https)[^.]*?\.|^\.\.?)::gi' \
| sed -r '/[^a-zA-Z0-9.-]/d; /^[^a-zA-Z0-9.]/d; /[^a-zA-Z0-9]$/d; /^[[:space:]]*$/d; /[[:space:]]/d; /^[[:space:]]*#/d; /\.{2,}/d' \
| sort -u > stage1
Regex filters:
  • /[^a-zA-Z0-9.-]/d - Remove lines with invalid characters
  • /^[^a-zA-Z0-9.]/d - Remove lines starting with invalid chars
  • /[^a-zA-Z0-9]$/d - Remove lines ending with invalid chars
  • /^[[:space:]]*$/d - Remove empty lines
  • /[[:space:]]/d - Remove lines with whitespace
  • /^[[:space:]]*#/d - Remove comments
  • /\.{2,}/d - Remove consecutive dots

Protocol Prefix Removal

sed -r 's:(^\.*.?(www|ftp|ftps|ftpes|sftp|pop|pop3|smtp|imap|http|https)[^.]*?\.|^\.\.?)::gi'
Removes common prefixes:
  • www.example.comexample.com
  • ftp.example.comexample.com
  • http.example.comexample.com
  • smtp.example.comexample.com
These are subdomain removals, not protocol stripping. The domain www.example.com becomes example.com, then gets prefixed with . for Squid format: .example.com

Security: Homograph Attack Protection

Homograph attacks use lookalike characters to mimic legitimate domains:

Examples of Homograph Attacks

Fake DomainTargetAttack Character
.ămăzon.comamazon.comLatin Small Letter A with Breve (Ä)
.googlÉ™.comgoogle.comLatin Small Letter Schwa (É™)
.òutlook.comoutlook.comLatin Small Letter O with Grave (ò)
.nętflix.comnetflix.comLatin Small Letter E with Ogonek (ę)
.yăhoo.comyahoo.comLatin Small Letter A with Breve (Ä)

Visual Similarity

To the human eye:
  • ămăzon looks like amazon
  • googlÉ™ looks like google
  • nÄ™tflix looks like netflix
But in ASCII/DNS:
  • They’re completely different domains
  • Used for phishing attacks
  • Can bypass simple blocklists

BlackWeb Protection

BlackWeb handles homographs two ways:
  1. Valid IDN: Convert to Punycode
    • café.frxn--caf-dma.fr (legitimate)
  2. Corrupted/Attack: Remove entirely
    • ămăzon.comREMOVED (phishing)

Character Set Normalization

Final conversion ensures charset=us-ascii:
bwupdate/bwupdate.sh
iconv -f "$(file -bi final.txt | sed 's/.*charset=//')" -t UTF-8//IGNORE final.txt | grep -P '^[\x00-\x7F]+$' > blackweb.txt
Result: Clean, standardized list ready for:
  • DNS resolution
  • Domain comparison
  • Squid-Cache integration
  • No encoding issues
  • No corrupted entries

Summary of Debugging Stages

1

Domain Debugging

Remove overlapping domains, convert to Squid format, apply allowlist
2

TLD Validation

Verify against comprehensive TLD database (ccTLD, gTLD, etc.)
3

RFC 1035 Compliance

Remove hostnames > 63 characters, invalid hyphen placement
4

Punycode Conversion

Convert internationalized domains to ASCII-compatible encoding
5

Non-ASCII Cleanup

Remove corrupted encoding, homograph attacks, invalid characters
6

Character Set Normalization

Ensure final output is pure US-ASCII

Why This Matters

Performance

Clean, optimized domain list means faster Squid lookups and less memory usage

Accuracy

Removes false positives (legitimate sites) and invalid entries (dead domains)

Security

Detects and blocks homograph phishing attacks and IDN spoofing

Compatibility

Ensures domains work correctly with DNS, Squid, and all ASCII-based systems

Next Steps

Back to Overview

Return to the Update Process overview

Build docs developers (and LLMs) love