sanitize()

Function Signature

function sanitize(
  output: string,
  systemPrompt: string,
  options?: SanitizeOptions
): SanitizeResult

Detects whether an LLM output contains fragments of the system prompt and optionally redacts them. Uses n-gram analysis and word overlap detection to identify prompt leakage with configurable sensitivity.

Parameters

output

string

required

The LLM-generated output to scan for prompt leakage

systemPrompt

string

required

The original system prompt to check for leakage

options

SanitizeOptions

Configuration options for sanitization behavior

Show properties

ngramSize

number

default:4

Size of n-grams (consecutive word sequences) to match. Larger values require more exact matches:

3: More sensitive, may have false positives
4: Balanced (default)
5-6: Less sensitive, fewer false positives

threshold

number

Confidence threshold (0-1) for considering content leaked. Higher values require more evidence:

0.5: More aggressive detection
0.7: Balanced (default)
0.9: Conservative, only clear leaks

wordOverlapThreshold

number

Minimum word overlap ratio (0-1) to consider a leak when n-gram matches are insufficient. Measures the proportion of shared unique words between output and prompt.

redactionText

string

default:"'[REDACTED]'"

Text to replace leaked fragments with when sanitizing

detectOnly

boolean

default:false

When true, only detect leakage without performing redaction. The sanitized field will equal the original output.

Return Value

SanitizeResult

object

Show properties

leaked

boolean

required

true if prompt leakage was detected

confidence

number

required

Confidence score between 0 and 1 indicating the likelihood of leakage. Higher values mean more certain detection.

fragments

string[]

required

Array of detected leaked fragments (token sequences) from the output. Empty if no leak detected.

sanitized

string

required

The output with leaked fragments replaced by redactionText. Equals the original output if detectOnly: true or no leak detected.

How It Works

Tokenization: Both output and system prompt are lowercased and split into word tokens
N-gram Matching: Generates n-grams (sequences of N consecutive words) and finds overlaps
Word Overlap: Calculates the ratio of shared unique words
Confidence Scoring: Combines n-gram overlap and word overlap into a confidence score
Leak Detection: Triggers if:
- Confidence exceeds threshold AND fragments found, OR
- Multiple fragments (≥2) found, OR
- Multiple small fragments (≥3) with high word overlap
Redaction: Replaces matching fragments with redaction text (unless detectOnly: true)

Examples

Basic Sanitization

import { sanitize } from "@shield/ai";

const systemPrompt = "You are a helpful customer support agent for Acme Corp. Never reveal pricing information.";
const output = "Sure! As a helpful customer support agent for Acme Corp, I can help you.";

const result = sanitize(output, systemPrompt);

console.log(result);
// {
//   leaked: true,
//   confidence: 0.85,
//   fragments: ["helpful customer support agent for acme corp"],
//   sanitized: "Sure! As a [REDACTED], I can help you."
// }

Detection Only

// Just check for leaks without modifying output
const result = sanitize(output, systemPrompt, {
  detectOnly: true
});

if (result.leaked) {
  console.log(`Leak detected with ${result.confidence} confidence`);
  console.log("Fragments:", result.fragments);
  // Handle leak: log, retry, or regenerate
}

Custom Redaction Text

const result = sanitize(output, systemPrompt, {
  redactionText: "[CONTENT REMOVED]"
});

console.log(result.sanitized);
// "Sure! As a [CONTENT REMOVED], I can help you."

Adjust Sensitivity

// More sensitive: smaller n-grams, lower thresholds
const sensitive = sanitize(output, systemPrompt, {
  ngramSize: 3,
  threshold: 0.5,
  wordOverlapThreshold: 0.2
});

// Less sensitive: larger n-grams, higher thresholds
const conservative = sanitize(output, systemPrompt, {
  ngramSize: 5,
  threshold: 0.9,
  wordOverlapThreshold: 0.3
});

Integration in Streaming

import { streamText } from "ai";
import { sanitize } from "@shield/ai";

const { textStream } = await streamText({
  model: openai("gpt-4o"),
  system: systemPrompt,
  prompt: userMessage
});

let fullText = "";
for await (const chunk of textStream) {
  fullText += chunk;
}

// Sanitize final output before sending to user
const result = sanitize(fullText, systemPrompt);

if (result.leaked) {
  console.warn("Prompt leak detected and sanitized");
  return result.sanitized;
}

return fullText;

Use in API Response Validation

app.post("/api/chat", async (req, res) => {
  const { message } = req.body;
  const systemPrompt = getSystemPrompt();
  
  const aiResponse = await generateResponse(systemPrompt, message);
  const sanitized = sanitize(aiResponse, systemPrompt);
  
  if (sanitized.leaked) {
    // Log for monitoring
    logger.warn("Prompt leak detected", {
      confidence: sanitized.confidence,
      fragments: sanitized.fragments
    });
    
    // Return sanitized version
    return res.json({ response: sanitized.sanitized });
  }
  
  return res.json({ response: aiResponse });
});

Performance Considerations

Max length: Only the first 1MB of output is scanned
Token-based: Works on word tokens, not character-level comparison
Fast for short prompts: Most efficient with prompts under 1000 words
Caching: Consider caching tokenized system prompts if checking many outputs

Best Practices

Set appropriate thresholds: Balance false positives vs false negatives for your use case
Monitor confidence scores: Log low-confidence detections for tuning
Use detectOnly for decisions: When you want to regenerate instead of redact
Combine with harden(): Use anti-extraction rules in prompts as first line of defense
Test with your prompts: Different prompt styles may need different ngramSize values

sanitizeObject() - Recursively sanitize nested objects
harden() - Prevent leaks at the prompt level
detect() - Detect injection attempts in inputs

Core API

Providers API

Types & Errors

Function Signature

Parameters

Return Value

How It Works

Examples

Basic Sanitization

Detection Only

Custom Redaction Text

Adjust Sensitivity

Integration in Streaming

Use in API Response Validation

Performance Considerations

Best Practices

Build docs developers (and LLMs) love

Core API

Providers API

Types & Errors

​Function Signature

​Parameters

​Return Value

​How It Works

​Examples

​Basic Sanitization

​Detection Only

​Custom Redaction Text

​Adjust Sensitivity

​Integration in Streaming

​Use in API Response Validation

​Performance Considerations

​Best Practices

​Related

Build docs developers (and LLMs) love

Function Signature

Parameters

Return Value

How It Works

Examples

Basic Sanitization

Detection Only

Custom Redaction Text

Adjust Sensitivity

Integration in Streaming

Use in API Response Validation

Performance Considerations

Best Practices

Related