Semantic Anomaly Datasets
Semantic anomaly datasets identify anomalies based on semantic attributes like color, object type, or facial features. LAFT provides three datasets designed for evaluating anomaly detection in semantic contexts.
Overview
All semantic datasets inherit from SemanticAnomalyDataset and provide:
Multi-attribute anomalies : Each sample has multiple boolean attributes (False: normal, True: anomaly)
Configurable definitions : Define what constitutes an anomaly via config dictionaries
Subset extraction : Get normal-only samples with get_normal_subset()
Building a Semantic Dataset
Use the build_semantic_dataset() function to load any semantic dataset:
from laft.datasets import build_semantic_dataset
from torchvision import transforms
transform = transforms.Compose([
transforms.Resize( 224 ),
transforms.ToTensor(),
])
dataset = build_semantic_dataset(
name = "color_mnist" , # or "waterbirds", "celeba"
split = "train" , # "train", "valid", or "test"
root = "./data" , # data directory
transform = transform,
config = None , # None for default config
)
image, attrs = dataset[ 0 ]
print ( f "Attributes: { attrs } " ) # torch.Tensor of bools [num_attrs]
The config parameter is optional. Passing None uses the dataset’s default anomaly definition.
Color MNIST
Color MNIST combines digit classification with color attributes for multi-attribute anomaly detection.
Configuration
Define anomalies by digit and color:
from laft.datasets import build_semantic_dataset
config = {
"number" : {
0 : False , # Normal
1 : False , # Normal
2 : False , # Normal
3 : False , # Normal
4 : False , # Normal
5 : True , # Anomaly
6 : True , # Anomaly
7 : True , # Anomaly
8 : True , # Anomaly
9 : True , # Anomaly
},
"color" : {
"red" : False , # Normal
"green" : True , # Anomaly
"blue" : True , # Anomaly
},
}
dataset = build_semantic_dataset(
name = "color_mnist" ,
split = "train" ,
root = "./data" ,
config = config,
seed = 42 , # For reproducible train/valid split
)
Dataset Details
Attributes
number : Digit class (0-9)
color : Red, green, or blue
Splits
train : 45,000 images (4,500 per digit)
valid : 9,000 images (900 per digit)
test : 8,700 images (870 per digit)
Implementation Reference
The coloring process pads MNIST images and applies RGB channels:
# From laft/datasets/color_mnist.py:46-59
def _coloring ( image , color : str ) -> Image.Image:
image = torch.constant_pad_nd(image, ( 28 , 28 , 28 , 28 ), 0 )
zero_image = torch.zeros_like(image)
if color == "red" :
image = torch.stack([image, zero_image, zero_image], dim =- 1 )
elif color == "green" :
image = torch.stack([zero_image, image, zero_image], dim =- 1 )
elif color == "blue" :
image = torch.stack([zero_image, zero_image, image], dim =- 1 )
return Image.fromarray(image.numpy())
Usage Example
from laft.datasets import build_semantic_dataset
from torch.utils.data import DataLoader
# Create dataset
dataset = build_semantic_dataset(
name = "color_mnist" ,
split = "train" ,
root = "./data" ,
)
print (dataset) # Shows distribution table
# Create normal-only subset
normal_dataset = dataset.get_normal_subset()
# DataLoader
loader = DataLoader(dataset, batch_size = 32 , shuffle = True )
for images, attrs in loader:
# images: [batch_size, 3, 84, 84] (padded and colored)
# attrs: [batch_size, 2] (number, color)
number_anomaly = attrs[:, 0 ] # True if digit is anomaly
color_anomaly = attrs[:, 1 ] # True if color is anomaly
break
Waterbirds
Waterbirds dataset contains images of land birds and water birds in land and water backgrounds, designed to study spurious correlations.
Configuration
config = {
"bird" : {
"land" : True , # Anomaly
"water" : False , # Normal
},
"background" : {
"land" : True , # Anomaly
"water" : False , # Normal
}
}
dataset = build_semantic_dataset(
name = "waterbirds" ,
split = "train" ,
root = "./data" ,
config = config,
)
Dataset Details
Attributes
bird : Land bird or water bird
background : Land or water setting
Source Based on Caltech-UCSD Birds 200 and Places datasets
Usage Example
from laft.datasets import build_semantic_dataset
dataset = build_semantic_dataset(
name = "waterbirds" ,
split = "test" ,
root = "./data" ,
)
image, attrs = dataset[ 0 ]
bird_is_anomaly = attrs[ 0 ] # True if land bird (default)
background_is_anomaly = attrs[ 1 ] # True if land background (default)
# Check distribution
print (dataset)
The Waterbirds dataset requires downloading from the official source . Place the extracted waterbirds_v1.0 folder in your data directory.
CelebA
CelebA provides facial attribute-based anomaly detection with 40 binary attributes per image.
Configuration
Select any attributes from the 40 available CelebA attributes:
config = {
"Blond_Hair" : False , # Blonde hair is normal
"Eyeglasses" : True , # Eyeglasses is anomaly
}
dataset = build_semantic_dataset(
name = "celeba" ,
split = "train" ,
root = "./data" ,
config = config,
)
Available Attributes
View all 40 CelebA attributes
# From laft/datasets/celeba.py:12-53
ATTRS = [
"5_o_Clock_Shadow" , "Arched_Eyebrows" , "Attractive" ,
"Bags_Under_Eyes" , "Bald" , "Bangs" , "Big_Lips" , "Big_Nose" ,
"Black_Hair" , "Blond_Hair" , "Blurry" , "Brown_Hair" ,
"Bushy_Eyebrows" , "Chubby" , "Double_Chin" , "Eyeglasses" ,
"Goatee" , "Gray_Hair" , "Heavy_Makeup" , "High_Cheekbones" ,
"Male" , "Mouth_Slightly_Open" , "Mustache" , "Narrow_Eyes" ,
"No_Beard" , "Oval_Face" , "Pale_Skin" , "Pointy_Nose" ,
"Receding_Hairline" , "Rosy_Cheeks" , "Sideburns" , "Smiling" ,
"Straight_Hair" , "Wavy_Hair" , "Wearing_Earrings" ,
"Wearing_Hat" , "Wearing_Lipstick" , "Wearing_Necklace" ,
"Wearing_Necktie" , "Young" ,
]
Dataset Details
Size
train : 162,770 images
valid : 19,867 images
test : 19,962 images
Flexibility Configure 1-40 attributes as anomalies. True = attribute present is anomaly, False = attribute absent is anomaly
Usage Example
from laft.datasets import build_semantic_dataset
# Multi-attribute anomaly detection
config = {
"Eyeglasses" : True , # Wearing glasses is anomalous
"Bald" : True , # Being bald is anomalous
"Young" : False , # Not being young is anomalous
}
dataset = build_semantic_dataset(
name = "celeba" ,
split = "train" ,
root = "./data" ,
config = config,
)
image, attrs = dataset[ 0 ]
# attrs: [3] - one bool for each configured attribute
print ( f "Eyeglasses anomaly: { attrs[ 0 ] } " )
print ( f "Bald anomaly: { attrs[ 1 ] } " )
print ( f "Young anomaly: { attrs[ 2 ] } " )
CelebA downloads automatically via torchvision.datasets.CelebA. The first run will download ~1.4GB of data.
Working with Attributes
All semantic datasets return attribute tensors:
from laft.datasets import build_semantic_dataset
dataset = build_semantic_dataset(
name = "color_mnist" ,
split = "train" ,
root = "./data" ,
)
image, attrs = dataset[ 0 ]
# attrs is a boolean tensor: [num_attributes]
print ( f "Attribute names: { dataset.attr_names } " )
print ( f "Attributes: { attrs } " )
# Check if any attribute is anomalous
is_anomaly = attrs.any()
# Get normal subset (no anomalies)
normal_subset = dataset.get_normal_subset()
print ( f "Normal samples: { len (normal_subset) } " )
Dataset Statistics
View the distribution of attribute combinations:
dataset = build_semantic_dataset(
name = "waterbirds" ,
split = "test" ,
root = "./data" ,
)
# Print formatted statistics table
print (dataset)
Output shows percentage and count for each attribute combination:
╒═════════╤══════════════╤═════════╤═════════╕
│ bird │ background │ per. % │ num. # │
╞═════════╪══════════════╪═════════╪═════════╡
│ False │ False │ 42.3 │ 2255 │
├─────────┼──────────────┼─────────┼─────────┤
│ False │ True │ 8.7 │ 466 │
├─────────┼──────────────┼─────────┼─────────┤
│ True │ False │ 7.2 │ 385 │
├─────────┼──────────────┼─────────┼─────────┤
│ True │ True │ 41.8 │ 2229 │
╘═════════╧══════════════╧═════════╧═════════╛
Why Semantic Datasets?
Semantic datasets are valuable for anomaly detection because they:
Test conceptual understanding : Models must learn semantic features, not just pixel patterns
Enable multi-attribute analysis : Study how multiple factors contribute to anomalies
Support spurious correlation research : Datasets like Waterbirds reveal when models rely on shortcuts
Provide interpretability : Attribute-level labels explain why a sample is anomalous
Reference
API Summary
from laft.datasets import build_semantic_dataset
def build_semantic_dataset (
name : Literal[ "color_mnist" , "waterbirds" , "celeba" ],
split : Literal[ "train" , "valid" , "test" ],
root : str = "./data" ,
transform : Callable | None = None ,
config : dict | None = None , # None for default
** kwargs , # seed for color_mnist
) -> SemanticAnomalyDataset
Dataset Returns
image, attrs = dataset[index]
# image: PIL.Image or transformed tensor
# attrs: torch.Tensor of shape [num_attributes], dtype=torch.bool
Source Code View the complete implementation in laft/datasets/