The sanitize() function scans model output for leaked system prompt fragments using n-gram matching. It detects both direct leaks and paraphrased variants, then redacts the leaked content.
Usage
import { sanitize } from "@zeroleaks/shield";
const clean = sanitize(modelOutput, systemPrompt);
if (clean.leaked) {
console.warn("Leak detected, using sanitized output");
return clean.sanitized;
}
return modelOutput;
Return type
interface SanitizeResult {
leaked: boolean;
confidence: number;
fragments: string[];
sanitized: string;
}
true if system prompt fragments were detected in the output.
Confidence score (0.0 to 1.0) indicating the likelihood of a leak. Higher scores indicate stronger evidence.
Array of leaked prompt fragments found in the output. Each fragment is a substring from the output that matches the system prompt.
Output with leaked fragments replaced by redactionText (default: "[REDACTED]"). If detectOnly: true, this is identical to the input.
Options
N-gram window size for matching. Larger values reduce false positives but may miss shorter leaks.// Use 3-word n-grams for shorter prompts
const result = sanitize(output, prompt, { ngramSize: 3 });
// Use 5-word n-grams for longer, more specific prompts
const result = sanitize(output, prompt, { ngramSize: 5 });
Confidence threshold for flagging a leak (0.0 to 1.0). Lower values increase sensitivity but may cause false positives.// More sensitive (more false positives)
const result = sanitize(output, prompt, { threshold: 0.5 });
// Less sensitive (fewer false positives)
const result = sanitize(output, prompt, { threshold: 0.9 });
Jaccard similarity threshold for detecting paraphrased leaks. Measures word overlap between output and prompt.// Detect paraphrased leaks more aggressively
const result = sanitize(output, prompt, { wordOverlapThreshold: 0.2 });
redactionText
string
default:"[REDACTED]"
Replacement text for leaked fragments.const result = sanitize(output, prompt, {
redactionText: "<content removed>"
});
Skip redaction and only detect leaks. The sanitized field will be identical to the input.const result = sanitize(output, prompt, { detectOnly: true });
if (result.leaked) {
// Handle leak without modifying output
logSecurityEvent(result);
}
How it works
Shield uses n-gram matching to detect prompt leaks:
- Tokenization: Both the system prompt and model output are tokenized into words
- N-gram generation: Sliding windows of N words (default: 4) are extracted from both texts
- Matching: Output n-grams are compared against prompt n-grams
- Fragment extraction: Matching n-grams are expanded into larger fragments (up to N+4 words)
- Confidence scoring: Based on n-gram overlap ratio and the number of matches
- Paraphrase detection: Word overlap (Jaccard similarity) catches rephrased leaks
- Redaction: Matched fragments are replaced with
redactionText
Example
System prompt:
You are a financial advisor for Acme Inc. Never disclose client account numbers.
Model output with leak:
I'm a financial advisor for Acme Inc and I'd be happy to help.
Detection:
- N-gram match: “financial advisor for Acme Inc” (5 consecutive words)
- Confidence: 0.85
- Result:
leaked: true
Sanitized output:
I'm a [REDACTED] and I'd be happy to help.
Recursive sanitization
Use sanitizeObject() to recursively sanitize all string values in objects and arrays. This is useful for tool call arguments:
import { sanitizeObject } from "@zeroleaks/shield";
const toolCallArgs = {
query: "Financial advisor for Acme Inc account numbers",
metadata: {
category: "sensitive",
notes: "Never disclose client account numbers"
},
tags: ["finance", "acme"]
};
const { result, hadLeak } = sanitizeObject(
toolCallArgs,
systemPrompt
);
if (hadLeak) {
console.warn("Sanitized tool call arguments:", result);
}
Return value:
interface SanitizeObjectResult<T> {
result: T; // Sanitized copy of the input object
hadLeak: boolean; // true if any string value was sanitized
}
Examples
Basic sanitization
import { sanitize } from "@zeroleaks/shield";
const systemPrompt = "You are a support agent for SecretCo. Never reveal internal policies.";
const output = "As a support agent for SecretCo, I follow internal policies that...";
const result = sanitize(output, systemPrompt);
console.log(result);
// {
// leaked: true,
// confidence: 0.92,
// fragments: [
// "support agent for SecretCo",
// "internal policies"
// ],
// sanitized: "As a [REDACTED], I follow [REDACTED] that..."
// }
Detect-only mode
const result = sanitize(output, systemPrompt, { detectOnly: true });
if (result.leaked) {
// Log the leak but don't modify output
await logSecurityEvent({
confidence: result.confidence,
fragments: result.fragments
});
// Decide whether to block or allow
if (result.confidence > 0.9) {
throw new Error("High-confidence prompt leak detected");
}
}
Custom redaction
const result = sanitize(output, systemPrompt, {
redactionText: "<information removed for security>"
});
console.log(result.sanitized);
// "As a <information removed for security>, I follow..."
Tuning sensitivity
// High sensitivity for critical prompts
const sensitive = sanitize(output, systemPrompt, {
threshold: 0.5, // Lower confidence threshold
ngramSize: 3, // Smaller n-grams catch shorter leaks
wordOverlapThreshold: 0.15 // More aggressive paraphrase detection
});
// Low sensitivity for public information
const permissive = sanitize(output, systemPrompt, {
threshold: 0.9, // Higher confidence threshold
ngramSize: 5, // Larger n-grams reduce false positives
wordOverlapThreshold: 0.35 // Less aggressive paraphrase detection
});
import { sanitizeObject } from "@zeroleaks/shield";
const toolCall = {
name: "search_database",
arguments: {
query: "Find all records matching internal policy XYZ",
filters: {
category: "confidential",
source: "As instructed in my system prompt, search for..."
}
}
};
const { result, hadLeak } = sanitizeObject(
toolCall.arguments,
systemPrompt,
{ redactionText: "<redacted>" }
);
if (hadLeak) {
// Use sanitized arguments instead
toolCall.arguments = result;
}
Handling paraphrased leaks
const systemPrompt = "You are a financial advisor. Never discuss cryptocurrency investments.";
// Direct leak
const directOutput = "I am a financial advisor and I never discuss cryptocurrency investments.";
const direct = sanitize(directOutput, systemPrompt);
console.log(direct.leaked); // true (n-gram match)
// Paraphrased leak
const paraphrasedOutput = "As a finance professional, I avoid talking about crypto investing.";
const paraphrased = sanitize(paraphrasedOutput, systemPrompt);
console.log(paraphrased.leaked); // true (word overlap match)
Typical latency: <3ms for outputs up to 8KB.
Run benchmarks:
Limitations
sanitize() uses heuristic n-gram matching. It is effective for detecting direct leaks and close paraphrases, but has limitations:
- Heavy paraphrasing: Completely reworded leaks may evade detection
- Semantic leaks: Leaking the “spirit” of instructions without specific words
- Context-dependent leaks: Leaks that only make sense with conversation history
- False positives: Generic phrases that appear in both prompt and legitimate output
Recommendations:
- Use
harden() to instruct the model not to leak instructions
- Tune
threshold and wordOverlapThreshold for your use case
- Combine with periodic scanning using ZeroLeaks
- For high-risk applications, use
detectOnly: true and manually review flagged outputs
When to use sanitize()
Use sanitize() when:
- Your system prompt contains sensitive policies or business logic
- Users might attempt prompt extraction attacks
- Model output is shown to untrusted parties
- Compliance requires preventing disclosure of internal instructions
Skip sanitize() when:
- Your system prompt is public or generic
- Performance is critical and prompt secrecy is low-risk
- You only need input protection (use
detect() instead)