Phisherman processes every URL through a multi-stage pipeline: a cache lookup, parallel checker execution, score aggregation, and a cache write. This page walks through each stage in detail.
Request lifecycle
HTTP request
│
▼
Rate limiter
│
▼
Redis cache check ──── hit ────▶ return cached ScanResult
│ miss
▼
CheckerRegistry.runAll(url)
┌──┴─────────────────────────────────────┐
│ heuristics openphish gsb urlhaus │ (parallel, 2500 ms timeout each)
│ phishtank phishstats │
└──┬─────────────────────────────────────┘
│ []CheckResult
▼
Aggregate score → Math.min(100, sum)
│
▼
Determine verdict (safe / suspicious / phishing)
│
▼
Cache result in Redis (if non-safe, or SCAN_CACHE_SAFE_RESULTS=true)
│
▼
Return ScanResult
The CheckerRegistry
CheckerRegistry is a small registry class that holds a list of Checker objects and runs them all in parallel.
// src/CheckerRegistry.ts
class CheckerRegistry {
private checkers: Checker[] = [];
register(checker: Checker) {
this.checkers.push(checker);
}
async runAll(url: string): Promise<{ checks: CheckResult[]; timing: Record<string, number> }> {
const timing: Record<string, number> = {};
const TIMEOUT_MS = 2500; // 2.5s maximum per checker
const checks = await Promise.all(
this.checkers.map(async (checker) => {
const start = Date.now();
try {
const checkPromise = checker.check(url);
const timeoutPromise = new Promise<CheckResult>((_, reject) =>
setTimeout(() => reject(new Error("Timeout")), TIMEOUT_MS)
);
const result = await Promise.race([checkPromise, timeoutPromise]);
timing[checker.name] = Date.now() - start;
return result;
} catch (err: any) {
timing[checker.name] = Date.now() - start;
if (err.message === "Timeout") {
console.warn(`Checker ${checker.name} timed out for ${url}`);
return { score: 0, reason: `Checker ${checker.name} timed out` };
}
console.error(`Checker ${checker.name} failed:`, err);
return { score: 0, reason: `Checker ${checker.name} error` };
}
})
);
return { checks, timing };
}
}
Key properties of this design:
- All checkers run concurrently via
Promise.all — latency is bounded by the slowest checker, not the sum of all checkers.
- Each checker races against a 2500 ms timeout via
Promise.race. A timed-out checker contributes a score of 0 and does not fail the request.
- Per-checker execution time is recorded in the
timing map and returned in the ScanResult as executionTimeMs.
Checker registration
Checkers are registered in Scanner.ts at startup:
// src/Scanner.ts
registry.register(HeuristicsChecker);
registry.register(OpenPhishChecker);
registry.register(SafeBrowsingChecker);
registry.register(URLHausChecker);
registry.register(PhishTankChecker);
// registry.register(WebRiskChecker); // disabled
registry.register(PhishStatsChecker);
Each checker implements the Checker interface:
// src/types.ts
export interface Checker {
name: string;
check: (url: string) => Promise<CheckResult>;
}
Result caching
Scan results are cached in Redis to avoid re-running the full checker pipeline for recently seen URLs.
// src/Scanner.ts
const RESULT_CACHE_TTL_SECONDS = 300; // 5 minutes
const SCAN_CACHE_HASH = "scan_results";
const SCAN_CACHE_EXPIRY_ZSET = "scan_results_expiry";
Cache key — The URL is hashed with SHA-256:
function scanCacheId(url: string) {
return crypto.createHash("sha256").update(url).digest("hex");
}
Storage structure — To avoid Redis key explosion, all scan results are stored as fields of a single hash (scan_results). A companion sorted set (scan_results_expiry) stores each field ID with its expiry timestamp as the score, enabling efficient batch cleanup.
Cache read — On a cache hit, Phisherman checks the exp field against Date.now(). Expired entries are deleted opportunistically before the fresh scan runs.
Caching safe results — By default, URLs that resolve to a safe verdict are not cached, because they are high-volume and low-value to retain. Set SCAN_CACHE_SAFE_RESULTS=true to cache them:
const CACHE_SAFE_RESULTS = (process.env.SCAN_CACHE_SAFE_RESULTS || "").toLowerCase() === "true";
// ...
if (CACHE_SAFE_RESULTS || result.verdict !== "safe") {
// write to cache
}
Background feed refresh
CacheManager runs a background loop that keeps all threat feed data current. It is started once at server startup:
// src/CacheManager.ts
async start(intervalMs: number = 3600000) { // Default 1 hour
if (this.interval) return;
await this.runAll(); // run immediately on startup
this.interval = setInterval(() => this.runAll(), intervalMs);
}
On each cycle, runAll() invokes every registered RefreshTask in sequence, then runs three cleanup routines:
| Cleanup step | What it removes |
|---|
cleanupScanResults() | Expired entries from scan_results hash + ZSET |
cleanupWhois() | Expired WHOIS lookups from whois_data hash + ZSET |
cleanupHashCaches() | Expired entries from GSB, GWR, and DNS HashCache instances |
Each feed source registers its own refresh task with a source-specific interval:
| Source | Refresh interval |
|---|
| URLHaus | 5 minutes |
| OpenPhish | 15 minutes |
| PhishTank | 60 minutes |
| PhishStats | 90 minutes |
Feed refresh is checked on every invocation — not on a separate per-source timer. The CacheManager loop fires every hour by default, but each source compares Date.now() against its own last-update timestamp and only refetches if its individual interval has elapsed.