Skip to main content
Phisherman queries five external threat intelligence sources. Each source is fetched on a background refresh cycle and stored in Redis. At scan time, checkers perform fast set-membership lookups against the cached data.

Feed overview

FeedSourceRedis key(s)Refresh intervalMatch type
URLHausurlhaus.abuse.churlhaus_blacklist5 minutesExact URL
OpenPhishopenphish.comopenphish_urls, openphish_hosts15 minutesExact URL or hostname
PhishTankdata.phishtank.comphishtank_urls60 minutesExact URL
PhishStatsapi.phishstats.infophishstats_urls, phishstats_hosts90 minutesExact URL or hostname
Google Safe Browsingsafebrowsing.googleapis.comgsb_cache_hashPer-URL (1 hr TTL)API lookup
A Google Web Risk checker (WebRiskChecker) exists in the source at src/checkers/googleWebRisk.ts but is currently disabled. It is commented out in Scanner.ts and does not run.

Feed details

URLHaus publishes a live list of URLs serving active malware. Phisherman streams the CSV feed directly into Redis.Feed URL: https://urlhaus.abuse.ch/downloads/csv-online/
Redis key: urlhaus_blacklist (Redis Set)
Refresh interval: every 5 minutes
The feed is a quoted CSV with the format:
"id","dateadded","url","url_status","threat","tags","urlhaus_link","reporter"
Phisherman reads column index 2 (the URL) from each line. The feed is consumed as a stream using Node.js readline — no full in-memory buffering is needed. URLs are written to Redis in batches of 1000 to keep request sizes manageable:
// src/checkers/urlHaus.ts
const batchSize = 1000;
// ...
if (urlBatch.length >= batchSize) {
  await (redis as any).sadd(tempKey, ...urlBatch);
  urlBatch.length = 0;
  await new Promise(resolve => setImmediate(resolve));
}
An atomic key swap is used so the live set is never partially populated during a refresh:
await redis.rename(tempKey, REDIS_KEY_BLACKLIST);
At scan time, a single SISMEMBER lookup is performed against urlhaus_blacklist. A match returns a score of 100.
OpenPhish publishes a plain-text feed of active phishing URLs, one per line.Feed URL: https://openphish.com/feed.txt
Redis keys: openphish_urls (Set), openphish_hosts (Set)
Refresh interval: every 15 minutes
During refresh, Phisherman populates both a URL set and a hostname set from the feed. This enables two levels of matching at scan time:
// src/checkers/openPhish.ts
// Check exact URL
const urlMatch = await redis.sismember(REDIS_KEY_URLS, url);
if (urlMatch) return { score: 100, reason: "Listed in OpenPhish URL database" };

// Check hostname
const u = new URL(url.startsWith("http") ? url : `http://${url}`);
const hostMatch = await redis.sismember(REDIS_KEY_HOSTS, u.hostname);
if (hostMatch) return { score: 80, reason: "Domain listed in OpenPhish intelligence" };
An exact URL match returns 100; a hostname-only match returns 80. The refresh uses an atomic rename swap on both sets.
PhishTank provides a community-verified list of phishing URLs as a gzip-compressed CSV.Feed URL (default): https://data.phishtank.com/data/online-valid.csv.gz
Redis key: phishtank_urls (Set)
Refresh interval: every 60 minutes
Failure cooldown: 15 minutes
Phisherman prefers the CSV.GZ format because it can be decompressed and parsed in a streaming, constant-memory way. The JSON dump is intentionally skipped to avoid large in-memory buffering on small instances:
// src/checkers/phishtank.ts
if (isJson) {
  console.warn(`PhishTank endpoint looks like JSON (${url}); skipping to avoid large in-memory buffering.`);
  return false;
}
The gzip stream is decompressed inline:
const gunzip = zlib.createGunzip();
response.data.pipe(gunzip);
stream = gunzip;
If the primary feed fails, Phisherman automatically falls back to the JSON endpoint. If the fallback also fails, a phishtank_last_fail timestamp is written to Redis and no further refresh attempts are made for 15 minutes.You can override the feed URL with an environment variable:
PHISHTANK_API_URL=https://your-mirror.example.com/phishtank.csv.gz
At scan time, only an exact URL match is checked. A match returns a score of 100.
PhishStats provides a JSON API of recently reported phishing URLs.Feed URL: https://api.phishstats.info/api/phishing?_sort=-id&_size=20000
Redis keys: phishstats_urls (Set), phishstats_hosts (Set)
Refresh interval: every 90 minutes
The feed returns an array of entries in the format { id, url, ip, ... }. Phisherman extracts the url field from each entry and also extracts the hostname to populate a separate set:
// src/checkers/phishStats.ts
for (const entry of entries) {
    if (!entry.url) continue;
    const rawUrl = entry.url.trim();
    urlBatch.push(rawUrl);
    try {
        const u = new URL(rawUrl);
        hostBatch.push(u.hostname);
    } catch { }
}
As with OpenPhish, matching is attempted at two levels: exact URL (score 100) then hostname (score 80).
Google Safe Browsing is the only feed that does not use a bulk-refresh model. Each URL is checked against the API individually at scan time and the result is cached per-URL.API endpoint: https://safebrowsing.googleapis.com/v4/threatMatches:find
Redis storage: gsb_cache_hash (HashCache, 1-hour TTL per URL)
Requires: GOOGLE_SAFE_API_KEY environment variable
The request checks for four threat types:
// src/checkers/googleSafeBrowsing.ts
threatTypes: [
  "MALWARE",
  "SOCIAL_ENGINEERING",
  "UNWANTED_SOFTWARE",
  "POTENTIALLY_HARMFUL_APPLICATION",
],
If the GOOGLE_SAFE_API_KEY environment variable is not set, the checker returns { score: 0 } immediately without making a network request.Results are cached using the HashCache utility (see below). Valid results are cached for 1 hour; error responses (e.g. billing issues) are cached for 15 minutes to prevent hammering the API on a broken key:
const CACHE_TTL = 3600;       // 1 hour for valid results
const ERROR_CACHE_TTL = 900;  // 15 mins for errors
A match returns a score of 50.

The HashCache utility

Several caches (GSB, Google Web Risk, DNS) use a shared HashCache class instead of individual Redis keys. This prevents key explosion when caching a large number of per-URL or per-host results. How it works:
  1. Each entry’s key (e.g. a URL) is hashed with SHA-256 and truncated to 32 hex characters to produce a stable, short field ID.
  2. The field ID is used as a field in a single Redis hash (e.g. gsb_cache_hash).
  3. A companion sorted set (e.g. gsb_cache_expiry) stores each field ID with its expiry timestamp as the score.
  4. On get, if the entry exists but its exp has passed, it is deleted opportunistically.
  5. On the CacheManager cleanup cycle, all entries with score <= now in the ZSET are removed in bulk.
// src/utils/hashCache.ts
export const gsbCache = new HashCache("gsb_cache", 3600);  // Google Safe Browsing — 1 hour
export const gwrCache = new HashCache("gwr_cache", 3600);  // Google Web Risk — 1 hour
export const dnsCache = new HashCache("dns_cache", 3600);  // DNS resolution — 1 hour
This design means a deployment scanning thousands of unique URLs never creates thousands of top-level Redis keys — only two keys per HashCache instance (*_hash and *_expiry) exist regardless of how many entries are stored.

Build docs developers (and LLMs) love