Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/facebookresearch/audioseal/llms.txt

Use this file to discover all available pages before exploring further.

Overview

The AudioSeal detector (AudioSealDetector class) identifies watermarked audio segments and decodes embedded messages with sample-level precision. Unlike traditional watermark detectors that output a single binary decision, AudioSeal provides frame-by-frame probabilities, enabling localized detection in edited or concatenated audio.

Detector Architecture

The detector is simpler than the generator, consisting of two main components:

1. SEANet Encoder (Keep Dimension)

# From audioseal/models.py:355
class AudioSealDetector(torch.nn.Module):
    def __init__(
        self,
        encoder: SEANetEncoderKeepDimension,
        normalizer: Optional[NormalizationProcessor] = None,
        nbits: int = 0,
    ):
        super().__init__()
        last_layer = torch.nn.Conv1d(encoder.output_dim, 2 + nbits, 1)
        self.detector = torch.nn.Sequential(encoder, last_layer)
        self.nbits = nbits
Key Difference from Generator: Uses SEANetEncoderKeepDimension instead of regular SEANetEncoder
The standard encoder downsamples audio by a factor of 320 (with default ratios), collapsing temporal information. The detector needs to maintain temporal resolution to provide frame-by-frame detection probabilities.SEANetEncoderKeepDimension processes audio while preserving the temporal dimension, enabling localized watermark detection.
  • Same convolutional structure as the generator encoder
  • No temporal downsampling (or compensated with appropriate padding/upsampling)
  • Outputs: (batch, output_dim=32, frames) where frames ≈ input_samples
  • Much larger output than compressed encoder

2. Detection Head (1x1 Convolution)

A simple 1x1 convolution projects the encoder output to detection logits:
last_layer = torch.nn.Conv1d(encoder.output_dim, 2 + nbits, 1)
Output Channels:
  • Channel 0-1: Detection logits (watermark present/absent)
  • Channel 2-(1+nbits): Message decoding logits (16 channels for 16-bit message)
The 1x1 convolution acts as a learned linear projection applied independently to each time frame, enabling efficient frame-by-frame prediction.

Detection Process

The forward pass consists of several steps:

Step 1: Optional Loudness Normalization

# From audioseal/models.py:444
if self.normalizer is not None and not torch.jit.is_scripting():
    x = self.normalizer.loudness_normalization(x)
Loudness normalization helps maintain consistent detection performance across audio with varying volume levels:
1

Window Audio

Divide audio into overlapping windows
2

Compute RMS

Calculate energy for each window
3

Calculate Gain

Scale to target RMS (default: 0.1)
4

Apply with Hann Window

Smooth scaling to avoid artifacts

Step 2: Encoder Processing

result = self.detector(x)  # Shape: (batch, 2+nbits, frames)
The encoder processes the audio while maintaining temporal dimension, producing a multi-channel output with detection and message information.

Step 3: Detection Probability Calculation

# From audioseal/models.py:452
# Softmax on first 2 channels for detection
result[:, :2, :] = torch.softmax(result[:, :2, :], dim=1)
The first two channels contain raw logits that are converted to probabilities:
  • Channel 0: P(no watermark)
  • Channel 1: P(watermark present)
After softmax, result[:, 0, :] + result[:, 1, :] = 1.0 for each frame, ensuring valid probability distribution.

Step 4: Message Decoding

# From audioseal/models.py:421
@torch.jit.export
def decode_message(self, result: torch.Tensor) -> torch.Tensor:
    """
    Decode the message from the watermark result (batch x nbits x frames)
    Returns: The message of size batch x nbits (probability of 1 for each bit)
    """
    decoded_message = result.mean(dim=-1)  # Average across all frames
    return torch.sigmoid(decoded_message)  # Convert to [0, 1] probabilities
The same message is embedded throughout the entire watermarked audio. By averaging predictions across all frames, we:
  • Reduce noise and improve accuracy
  • Aggregate evidence from the entire audio
  • Obtain a single consensus message prediction
After averaging, the raw logits are passed through sigmoid to convert to probabilities in [0, 1], where:
  • Values close to 0 indicate bit = 0
  • Values close to 1 indicate bit = 1
  • Values near 0.5 indicate uncertainty

High-Level Detection API

The detect_watermark method provides a convenient interface:
# From audioseal/models.py:390
@torch.jit.export
def detect_watermark(
    self,
    x: torch.Tensor,
    sample_rate: Optional[int] = None,
    message_threshold: float = 0.5,
    detection_threshold: float = 0.5,
) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Returns:
        detect_prob: Probability of audio being watermarked (scalar per batch)
        message: Binary message tensor (batch x nbits)
    """
    result, message = self.forward(x, sample_rate=sample_rate)
    
    # Count frames above threshold
    detect_prob = (
        torch.count_nonzero(
            torch.gt(result[:, 1, :], detection_threshold), dim=-1
        ) / result.shape[-1]
    )
    
    # Convert message probabilities to binary
    message = torch.gt(message, message_threshold).int()
    
    return detect_prob, message
1

Get Frame Probabilities

Run the forward pass to get per-frame detection probabilities
2

Apply Detection Threshold

Count frames where P(watermark) > threshold (default 0.5)
3

Calculate Overall Probability

Proportion of frames above threshold = overall detection score
4

Binarize Message

Convert message probabilities to binary using threshold

Threshold Parameters

Two key thresholds control detection behavior:

Detection Threshold

detection_threshold: float = 0.5  # Default

Lower Threshold (e.g., 0.3)

  • More sensitive detection
  • Higher recall (fewer false negatives)
  • More false positives

Higher Threshold (e.g., 0.7)

  • More conservative detection
  • Higher precision (fewer false positives)
  • More false negatives

Message Threshold

message_threshold: float = 0.5  # Default
Determines when a message bit is considered 1 vs 0. Usually kept at 0.5 for balanced classification.
For production systems, tune detection_threshold based on your false positive/false negative tolerance. Use validation data to find the optimal threshold for your use case.

Usage Examples

Basic Detection

from audioseal import AudioSeal

# Load detector
detector = AudioSeal.load_detector("audioseal_detector_16bits")
detector.eval()

# Detect watermark (high-level API)
detect_prob, message = detector.detect_watermark(audio)

print(f"Detection probability: {detect_prob.item():.2%}")
if detect_prob > 0.5:
    print(f"Watermarked! Message: {message}")
else:
    print("No watermark detected")

Low-Level Detection (Frame-by-Frame)

# Get per-frame probabilities
result, message = detector(audio)

# result shape: (batch, 2, frames)
# Extract watermark probability for each frame
wm_prob_per_frame = result[:, 1, :]  # Shape: (batch, frames)

# Find watermarked regions
import torch
watermarked_frames = torch.where(wm_prob_per_frame > 0.5)[1]

print(f"Watermark detected in {len(watermarked_frames)} frames")
print(f"Total frames: {wm_prob_per_frame.shape[1]}")

Custom Thresholds

# More sensitive detection
detect_prob, message = detector.detect_watermark(
    audio,
    detection_threshold=0.3,  # Lower threshold
    message_threshold=0.5
)

# More conservative detection
detect_prob, message = detector.detect_watermark(
    audio,
    detection_threshold=0.7,  # Higher threshold
    message_threshold=0.5
)

Localized Detection in Edited Audio

# Detect watermarks in potentially edited audio
result, message = detector(edited_audio)
wm_prob = result[:, 1, :]  # Per-frame probabilities

# Find contiguous watermarked segments
from scipy.ndimage import label
watermarked_binary = (wm_prob[0] > 0.5).cpu().numpy()
segments, num_segments = label(watermarked_binary)

print(f"Found {num_segments} watermarked segments")

# Assuming 16kHz sample rate, 1 frame ≈ 1 sample
for i in range(1, num_segments + 1):
    segment_frames = np.where(segments == i)[0]
    start_time = segment_frames[0] / 16000
    end_time = segment_frames[-1] / 16000
    print(f"Segment {i}: {start_time:.2f}s - {end_time:.2f}s")
This localized detection enables identifying which parts of an audio file are watermarked, even if the audio has been edited or concatenated with unwatermarked content.

Performance Characteristics

Speed

Single forward pass through a convolutional network. Up to 100x faster than iterative decoding methods.

Accuracy

State-of-the-art detection performance even after compression, noise, and editing.

Localization

Frame-level precision enables detection in edited audio at 1/16,000 second resolution.

Scalability

Efficient batch processing for large-scale detection tasks.

Robustness to Audio Transformations

The detector is trained to be robust against common audio manipulations:
  • MP3 encoding (various bitrates)
  • AAC encoding
  • Opus codec
Detection remains reliable even at moderate compression levels.
  • Additive Gaussian noise
  • Environmental noise
  • Background music
Loudness normalization helps maintain detection under varying noise conditions.
  • Cutting and splicing
  • Concatenation
  • Speed changes
  • Volume adjustments
Localized detection enables identifying watermarked segments even in heavily edited audio.
  • Different sample rates (24kHz, 44.1kHz, 48kHz)
  • Sample rate conversion
Model generalizes well to different sample rates despite being trained on 16kHz.
While AudioSeal is robust to many transformations, extremely aggressive modifications (e.g., very low bitrate compression, severe distortion) may degrade detection performance.

Technical Specifications

ParameterValueDescription
encoder.output_dim32Encoder output channels
nbits16Message length (0 for detection-only)
detection_threshold0.5Default frame-level threshold
message_threshold0.5Default message bit threshold
frames_per_second~16,000Temporal resolution at 16kHz

Design Choices

Why Frame-by-Frame Detection?

Traditional watermark detectors output a single binary decision for an entire audio file. AudioSeal’s frame-by-frame approach enables:
  1. Localized Detection: Identify which parts are watermarked
  2. Edit Detection: Find where audio was cut or modified
  3. Robustness: Aggregate evidence across multiple frames
  4. Flexibility: Apply different thresholds for different use cases

Why Separate Detection and Message Channels?

The detector outputs both detection logits (2 channels) and message logits (16 channels) simultaneously:
  • Detection is always active and works even without a message
  • Message is optional metadata that doesn’t affect detection
  • Allows using the same model for both 0-bit (detection-only) and 16-bit (detection + message) watermarking
You can train a detector with nbits=0 for detection-only applications, reducing model size and complexity.

Next Steps

Localized Watermarking

Understand sample-level precision

Generation

Learn about watermark generation

API Reference

Full detector API documentation

Build docs developers (and LLMs) love