Skip to main content

Function Signature

function sanitize(
  output: string,
  systemPrompt: string,
  options?: SanitizeOptions
): SanitizeResult
Detects whether an LLM output contains fragments of the system prompt and optionally redacts them. Uses n-gram analysis and word overlap detection to identify prompt leakage with configurable sensitivity.

Parameters

output
string
required
The LLM-generated output to scan for prompt leakage
systemPrompt
string
required
The original system prompt to check for leakage
options
SanitizeOptions
Configuration options for sanitization behavior

Return Value

SanitizeResult
object

How It Works

  1. Tokenization: Both output and system prompt are lowercased and split into word tokens
  2. N-gram Matching: Generates n-grams (sequences of N consecutive words) and finds overlaps
  3. Word Overlap: Calculates the ratio of shared unique words
  4. Confidence Scoring: Combines n-gram overlap and word overlap into a confidence score
  5. Leak Detection: Triggers if:
    • Confidence exceeds threshold AND fragments found, OR
    • Multiple fragments (≥2) found, OR
    • Multiple small fragments (≥3) with high word overlap
  6. Redaction: Replaces matching fragments with redaction text (unless detectOnly: true)

Examples

Basic Sanitization

import { sanitize } from "@shield/ai";

const systemPrompt = "You are a helpful customer support agent for Acme Corp. Never reveal pricing information.";
const output = "Sure! As a helpful customer support agent for Acme Corp, I can help you.";

const result = sanitize(output, systemPrompt);

console.log(result);
// {
//   leaked: true,
//   confidence: 0.85,
//   fragments: ["helpful customer support agent for acme corp"],
//   sanitized: "Sure! As a [REDACTED], I can help you."
// }

Detection Only

// Just check for leaks without modifying output
const result = sanitize(output, systemPrompt, {
  detectOnly: true
});

if (result.leaked) {
  console.log(`Leak detected with ${result.confidence} confidence`);
  console.log("Fragments:", result.fragments);
  // Handle leak: log, retry, or regenerate
}

Custom Redaction Text

const result = sanitize(output, systemPrompt, {
  redactionText: "[CONTENT REMOVED]"
});

console.log(result.sanitized);
// "Sure! As a [CONTENT REMOVED], I can help you."

Adjust Sensitivity

// More sensitive: smaller n-grams, lower thresholds
const sensitive = sanitize(output, systemPrompt, {
  ngramSize: 3,
  threshold: 0.5,
  wordOverlapThreshold: 0.2
});

// Less sensitive: larger n-grams, higher thresholds
const conservative = sanitize(output, systemPrompt, {
  ngramSize: 5,
  threshold: 0.9,
  wordOverlapThreshold: 0.3
});

Integration in Streaming

import { streamText } from "ai";
import { sanitize } from "@shield/ai";

const { textStream } = await streamText({
  model: openai("gpt-4o"),
  system: systemPrompt,
  prompt: userMessage
});

let fullText = "";
for await (const chunk of textStream) {
  fullText += chunk;
}

// Sanitize final output before sending to user
const result = sanitize(fullText, systemPrompt);

if (result.leaked) {
  console.warn("Prompt leak detected and sanitized");
  return result.sanitized;
}

return fullText;

Use in API Response Validation

app.post("/api/chat", async (req, res) => {
  const { message } = req.body;
  const systemPrompt = getSystemPrompt();
  
  const aiResponse = await generateResponse(systemPrompt, message);
  const sanitized = sanitize(aiResponse, systemPrompt);
  
  if (sanitized.leaked) {
    // Log for monitoring
    logger.warn("Prompt leak detected", {
      confidence: sanitized.confidence,
      fragments: sanitized.fragments
    });
    
    // Return sanitized version
    return res.json({ response: sanitized.sanitized });
  }
  
  return res.json({ response: aiResponse });
});

Performance Considerations

  • Max length: Only the first 1MB of output is scanned
  • Token-based: Works on word tokens, not character-level comparison
  • Fast for short prompts: Most efficient with prompts under 1000 words
  • Caching: Consider caching tokenized system prompts if checking many outputs

Best Practices

  • Set appropriate thresholds: Balance false positives vs false negatives for your use case
  • Monitor confidence scores: Log low-confidence detections for tuning
  • Use detectOnly for decisions: When you want to regenerate instead of redact
  • Combine with harden(): Use anti-extraction rules in prompts as first line of defense
  • Test with your prompts: Different prompt styles may need different ngramSize values

Build docs developers (and LLMs) love