sanitize()

The sanitize() function scans model output for leaked system prompt fragments using n-gram matching. It detects both direct leaks and paraphrased variants, then redacts the leaked content.

Usage

import { sanitize } from "@zeroleaks/shield";

const clean = sanitize(modelOutput, systemPrompt);
if (clean.leaked) {
  console.warn("Leak detected, using sanitized output");
  return clean.sanitized;
}
return modelOutput;

Return type

interface SanitizeResult {
  leaked: boolean;
  confidence: number;
  fragments: string[];
  sanitized: string;
}

leaked

boolean

true if system prompt fragments were detected in the output.

confidence

number

Confidence score (0.0 to 1.0) indicating the likelihood of a leak. Higher scores indicate stronger evidence.

fragments

string[]

Array of leaked prompt fragments found in the output. Each fragment is a substring from the output that matches the system prompt.

sanitized

string

Output with leaked fragments replaced by redactionText (default: "[REDACTED]"). If detectOnly: true, this is identical to the input.

Options

ngramSize

number

default:4

N-gram window size for matching. Larger values reduce false positives but may miss shorter leaks.

// Use 3-word n-grams for shorter prompts
const result = sanitize(output, prompt, { ngramSize: 3 });

// Use 5-word n-grams for longer, more specific prompts
const result = sanitize(output, prompt, { ngramSize: 5 });

threshold

number

Confidence threshold for flagging a leak (0.0 to 1.0). Lower values increase sensitivity but may cause false positives.

// More sensitive (more false positives)
const result = sanitize(output, prompt, { threshold: 0.5 });

// Less sensitive (fewer false positives)
const result = sanitize(output, prompt, { threshold: 0.9 });

wordOverlapThreshold

number

Jaccard similarity threshold for detecting paraphrased leaks. Measures word overlap between output and prompt.

// Detect paraphrased leaks more aggressively
const result = sanitize(output, prompt, { wordOverlapThreshold: 0.2 });

redactionText

string

default:"[REDACTED]"

Replacement text for leaked fragments.

const result = sanitize(output, prompt, {
  redactionText: "<content removed>"
});

detectOnly

boolean

default:false

Skip redaction and only detect leaks. The sanitized field will be identical to the input.

const result = sanitize(output, prompt, { detectOnly: true });
if (result.leaked) {
  // Handle leak without modifying output
  logSecurityEvent(result);
}

How it works

Shield uses n-gram matching to detect prompt leaks:

Tokenization: Both the system prompt and model output are tokenized into words
N-gram generation: Sliding windows of N words (default: 4) are extracted from both texts
Matching: Output n-grams are compared against prompt n-grams
Fragment extraction: Matching n-grams are expanded into larger fragments (up to N+4 words)
Confidence scoring: Based on n-gram overlap ratio and the number of matches
Paraphrase detection: Word overlap (Jaccard similarity) catches rephrased leaks
Redaction: Matched fragments are replaced with redactionText

Example

System prompt:

You are a financial advisor for Acme Inc. Never disclose client account numbers.

Model output with leak:

I'm a financial advisor for Acme Inc and I'd be happy to help.

Detection:

N-gram match: “financial advisor for Acme Inc” (5 consecutive words)
Confidence: 0.85
Result: leaked: true

Sanitized output:

I'm a [REDACTED] and I'd be happy to help.

Recursive sanitization

Use sanitizeObject() to recursively sanitize all string values in objects and arrays. This is useful for tool call arguments:

import { sanitizeObject } from "@zeroleaks/shield";

const toolCallArgs = {
  query: "Financial advisor for Acme Inc account numbers",
  metadata: {
    category: "sensitive",
    notes: "Never disclose client account numbers"
  },
  tags: ["finance", "acme"]
};

const { result, hadLeak } = sanitizeObject(
  toolCallArgs,
  systemPrompt
);

if (hadLeak) {
  console.warn("Sanitized tool call arguments:", result);
}

Return value:

interface SanitizeObjectResult<T> {
  result: T;        // Sanitized copy of the input object
  hadLeak: boolean; // true if any string value was sanitized
}

Examples

Basic sanitization

import { sanitize } from "@zeroleaks/shield";

const systemPrompt = "You are a support agent for SecretCo. Never reveal internal policies.";
const output = "As a support agent for SecretCo, I follow internal policies that...";

const result = sanitize(output, systemPrompt);

console.log(result);
// {
//   leaked: true,
//   confidence: 0.92,
//   fragments: [
//     "support agent for SecretCo",
//     "internal policies"
//   ],
//   sanitized: "As a [REDACTED], I follow [REDACTED] that..."
// }

Detect-only mode

const result = sanitize(output, systemPrompt, { detectOnly: true });

if (result.leaked) {
  // Log the leak but don't modify output
  await logSecurityEvent({
    confidence: result.confidence,
    fragments: result.fragments
  });
  
  // Decide whether to block or allow
  if (result.confidence > 0.9) {
    throw new Error("High-confidence prompt leak detected");
  }
}

Custom redaction

const result = sanitize(output, systemPrompt, {
  redactionText: "<information removed for security>"
});

console.log(result.sanitized);
// "As a <information removed for security>, I follow..."

Tuning sensitivity

// High sensitivity for critical prompts
const sensitive = sanitize(output, systemPrompt, {
  threshold: 0.5,          // Lower confidence threshold
  ngramSize: 3,            // Smaller n-grams catch shorter leaks
  wordOverlapThreshold: 0.15  // More aggressive paraphrase detection
});

// Low sensitivity for public information
const permissive = sanitize(output, systemPrompt, {
  threshold: 0.9,          // Higher confidence threshold
  ngramSize: 5,            // Larger n-grams reduce false positives
  wordOverlapThreshold: 0.35  // Less aggressive paraphrase detection
});

Sanitizing tool call arguments

import { sanitizeObject } from "@zeroleaks/shield";

const toolCall = {
  name: "search_database",
  arguments: {
    query: "Find all records matching internal policy XYZ",
    filters: {
      category: "confidential",
      source: "As instructed in my system prompt, search for..."
    }
  }
};

const { result, hadLeak } = sanitizeObject(
  toolCall.arguments,
  systemPrompt,
  { redactionText: "<redacted>" }
);

if (hadLeak) {
  // Use sanitized arguments instead
  toolCall.arguments = result;
}

Handling paraphrased leaks

const systemPrompt = "You are a financial advisor. Never discuss cryptocurrency investments.";

// Direct leak
const directOutput = "I am a financial advisor and I never discuss cryptocurrency investments.";
const direct = sanitize(directOutput, systemPrompt);
console.log(direct.leaked); // true (n-gram match)

// Paraphrased leak
const paraphrasedOutput = "As a finance professional, I avoid talking about crypto investing.";
const paraphrased = sanitize(paraphrasedOutput, systemPrompt);
console.log(paraphrased.leaked); // true (word overlap match)

Performance

Typical latency: <3ms for outputs up to 8KB. Run benchmarks:

bun run benchmark

Limitations

sanitize() uses heuristic n-gram matching. It is effective for detecting direct leaks and close paraphrases, but has limitations:

Heavy paraphrasing: Completely reworded leaks may evade detection
Semantic leaks: Leaking the “spirit” of instructions without specific words
Context-dependent leaks: Leaks that only make sense with conversation history
False positives: Generic phrases that appear in both prompt and legitimate output

Recommendations:

Use harden() to instruct the model not to leak instructions
Tune threshold and wordOverlapThreshold for your use case
Combine with periodic scanning using ZeroLeaks
For high-risk applications, use detectOnly: true and manually review flagged outputs

When to use sanitize()

Use sanitize() when:

Your system prompt contains sensitive policies or business logic
Users might attempt prompt extraction attacks
Model output is shown to untrusted parties
Compliance requires preventing disclosure of internal instructions

Skip sanitize() when:

Your system prompt is public or generic
Performance is critical and prompt secrecy is low-risk
You only need input protection (use detect() instead)

Get Started

Core Functions

Provider Integrations

Advanced

Usage

Return type

Options

How it works

Example

Recursive sanitization

Examples

Basic sanitization

Detect-only mode

Custom redaction

Tuning sensitivity

Sanitizing tool call arguments

Handling paraphrased leaks

Performance

Limitations

When to use sanitize()

Build docs developers (and LLMs) love

Get Started

Core Functions

Provider Integrations

Advanced

​Usage

​Return type

​Options

​How it works

​Example

​Recursive sanitization

​Examples

​Basic sanitization

​Detect-only mode

​Custom redaction

​Tuning sensitivity

​Sanitizing tool call arguments

​Handling paraphrased leaks

​Performance

​Limitations

​When to use sanitize()

Build docs developers (and LLMs) love

Usage

Return type

Options

How it works

Example

Recursive sanitization

Examples

Basic sanitization

Detect-only mode

Custom redaction

Tuning sensitivity

Sanitizing tool call arguments

Handling paraphrased leaks

Performance

Limitations

When to use sanitize()