Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/facebookresearch/audioseal/llms.txt

Use this file to discover all available pages before exploring further.

This guide shows you how to train custom AudioSeal models using your own datasets and configurations.
The training pipeline was developed using AudioCraft (version 1.4.0a1 and later) with PyTorch 2.1.0 and torchaudio 2.1.0.

Prerequisites

Before starting, ensure you have the required dependencies:
1

Install AudioCraft

AudioCraft >=1.4.0a1 is required. Install from source for maximum flexibility:
git clone https://github.com/facebookresearch/audiocraft.git
cd audiocraft
pip install -e .
2

Install ffmpeg

ffmpeg (version less than 5.0.0) is mandatory for AAC augmentation during training:
# On Ubuntu/Debian
sudo apt-get install ffmpeg

# Or with Anaconda/Miniconda
conda install "ffmpeg<5" -c conda-forge
Training will fail without ffmpeg, as AAC augmentation depends on it.
3

Verify Installation

Check that AudioCraft and ffmpeg are properly installed:
python -c "import audiocraft; print(audiocraft.__version__)"
ffmpeg -version

Dataset Preparation

AudioSeal requires datasets in AudioCraft’s format. Here’s how to prepare them:

Using VoxPopuli (Paper Dataset)

VoxPopuli is the dataset used in the AudioSeal paper:
# Download the VoxPopuli tools
git clone https://github.com/facebookresearch/voxpopuli.git
cd voxpopuli

# Download and segment the raw audio
python -m voxpopuli.download_audios --root [ROOT] --subset 400k
python -m voxpopuli.get_unlabelled_data --root [ROOT] --subset 400k

# Prepare the manifest with AudioCraft
cd [PATH_TO_AUDIOCRAFT]
python -m audiocraft.data.audio_dataset [ROOT] egs/voxpopuli/data.jsonl.gz

Dataset Configuration File

Create a dataset configuration file at [audiocraft_root]/configs/dset/audio/voxpopuli.yaml:
voxpopuli.yaml
# @package __global__

datasource:
  max_sample_rate: 16000
  max_channels: 1

  train: egs/voxpopuli
  valid: egs/voxpopuli
  evaluate: egs/voxpopuli
  generate: egs/voxpopuli

Using Custom Datasets

For your own dataset:
1

Organize Audio Files

Collect your audio files in a directory structure
2

Create Manifest

Use AudioCraft’s data tool to create a manifest:
python -m audiocraft.data.audio_dataset \
  /path/to/your/audio/files \
  /path/to/output/manifest.jsonl.gz
3

Create Config File

Create a YAML config file in configs/dset/audio/ with your dataset paths
See the AudioCraft dataset documentation for detailed instructions.

Training with Dora

AudioSeal uses Dora for experiment management and hyperparameter tuning.

Basic Training Command

Test the training pipeline locally:
# Navigate to AudioCraft directory
cd [PATH_TO_AUDIOCRAFT]

# Run training with example dataset
dora run solver=watermark/robustness dset=audio/example

# Run with VoxPopuli
dora run solver=watermark/robustness dset=audio/voxpopuli
By default, checkpoints and experiment files are stored in /tmp/audiocraft_$USER/outputs.

Custom Dora Configuration

To customize output directories and run on a SLURM cluster, create a config file:
my_config.yaml
default:
  dora_dir: /path/to/your/dora/experiments
  partitions:
    global: your_slurm_partition
    team: your_slurm_partition
  reference_dir: /tmp

darwin:  # Mac-specific config for local testing
  dora_dir: /path/to/local/dora/experiments
  partitions:
    global: your_slurm_partition
    team: your_slurm_partition
  reference_dir: /path/to/reference
Run training with custom config:
AUDIOCRAFT_CONFIG=my_config.yaml dora run \
  solver=watermark/robustness \
  dset=audio/voxpopuli

Training Parameters

Common parameters you can override:
# Train with specific number of bits
dora run solver=watermark/robustness \
  dset=audio/voxpopuli \
  +dummy_watermarker.nbits=16

# Adjust model architecture
dora run solver=watermark/robustness \
  dset=audio/voxpopuli \
  seanet.detector.output_dim=32

# Multi-GPU training
dora run solver=watermark/robustness \
  dset=audio/voxpopuli \
  device=cuda \
  ddp.world_size=4

Running Training Grids

For hyperparameter sweeps, use Dora grids:
# Reproduce the HuggingFace AudioSeal model (from ICML paper)
AUDIOCRAFT_CONFIG=my_config.yaml \
AUDIOCRAFT_DSET=audio/voxpopuli \
dora grid watermarking.1315_kbits_seeds
This runs multiple experiments with different hyperparameter combinations. See the AudioCraft watermarking grid for details.

Checkpoint Evaluation

After training completes, evaluate your checkpoints:

Locate Checkpoints

Checkpoints are saved to:
[DORA_DIR]/xps/[HASH-ID]/checkpoint_XXX.th
The HASH-ID is shown in the output log when running dora run.

Evaluate a Checkpoint

AUDIOCRAFT_CONFIG=my_config.yaml dora run \
  solver=watermark/robustness \
  execute_only=evaluate \
  dset=audio/voxpopuli \
  continue_from=/path/to/checkpoint_XXX.th \
  +dummy_watermarker.nbits=16 \
  seanet.detector.output_dim=32
Evaluate with different nbits settings to find the best configuration for your use case.

Converting Checkpoints for Inference

Training checkpoints contain both the generator and detector. Extract them separately for use with AudioSeal API:

Run Conversion Script

python [AUDIOSEAL_PATH]/src/scripts/checkpoints.py \
  --checkpoint=/path/to/checkpoint_XXX.th \
  --outdir=/path/to/output \
  --suffix=my_model
This creates:
  • generator_my_model.pth
  • detector_my_model.pth

Use Converted Checkpoints

from audioseal import AudioSeal

# Load your custom models
model = AudioSeal.load_generator(
    "/path/to/output/generator_my_model.pth",
    nbits=16
)

detector = AudioSeal.load_detector(
    "/path/to/output/detector_my_model.pth",
    nbits=16
)

# Use as normal
watermark = model.get_watermark(wav)
result, message = detector.detect_watermark(watermarked_audio)

Training Configuration

Key hyperparameters in the training configuration:
# SEANet encoder/decoder configuration
seanet:
  channels: 1
  dimension: 128
  n_filters: 32
  n_residual_layers: 1
  ratios: [8, 5, 4, 2]
  activation: ELU
  norm: weight_norm
  
  encoder:
    output_dim: 128
  
  decoder:
    output_dim: 1
  
  detector:
    output_dim: 32  # 2 + nbits

Troubleshooting

Unsupported Formats Error (Linux)

If you encounter Unsupported formats errors:
# Add your conda environment libs to LD_LIBRARY_PATH
LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH \
AUDIOCRAFT_CONFIG=my_config.yaml \
dora run solver=watermark/robustness dset=audio/voxpopuli

ffmpeg Not Found

Ensure ffmpeg is installed and accessible:
which ffmpeg  # Should show path to ffmpeg
ffmpeg -version  # Should show version < 5.0.0

Out of Memory

Reduce batch size or sequence length:
dora run solver=watermark/robustness \
  dset=audio/voxpopuli \
  batch_size=8 \
  max_segment_length=10.0

SLURM Issues

Verify partition names in your config:
partitions:
  global: correct_partition_name
  team: correct_partition_name

Complete Training Workflow

Here’s the complete workflow from dataset to inference:
1

Prepare Dataset

# Download and process VoxPopuli
python -m voxpopuli.download_audios --root /data/voxpopuli --subset 400k
python -m voxpopuli.get_unlabelled_data --root /data/voxpopuli --subset 400k

# Create manifest
python -m audiocraft.data.audio_dataset \
  /data/voxpopuli \
  /data/audiocraft/egs/voxpopuli/data.jsonl.gz
2

Create Configuration

# configs/dset/audio/voxpopuli.yaml
datasource:
  max_sample_rate: 16000
  max_channels: 1
  train: egs/voxpopuli
  valid: egs/voxpopuli
  evaluate: egs/voxpopuli
  generate: egs/voxpopuli
3

Run Training

AUDIOCRAFT_CONFIG=my_config.yaml dora run \
  solver=watermark/robustness \
  dset=audio/voxpopuli \
  +dummy_watermarker.nbits=16
4

Evaluate Checkpoint

AUDIOCRAFT_CONFIG=my_config.yaml dora run \
  solver=watermark/robustness \
  execute_only=evaluate \
  dset=audio/voxpopuli \
  continue_from=[DORA_DIR]/xps/[HASH]/checkpoint_best.th
5

Convert for Inference

python src/scripts/checkpoints.py \
  --checkpoint=[DORA_DIR]/xps/[HASH]/checkpoint_best.th \
  --outdir=./models \
  --suffix=custom_16bit
6

Use in Production

from audioseal import AudioSeal

model = AudioSeal.load_generator("./models/generator_custom_16bit.pth", nbits=16)
detector = AudioSeal.load_detector("./models/detector_custom_16bit.pth", nbits=16)

Next Steps

Attack Robustness

Learn how to evaluate your model’s robustness against attacks

API Reference

Explore the complete API for model loading and usage

Build docs developers (and LLMs) love