Evaluation
LAFT provides comprehensive evaluation metrics for binary anomaly detection through the laft.metrics module.
Overview
The evaluation system supports:
Standard metrics : AUROC, AUPRC, F1, FPR95
Automatic thresholding : Optimal threshold selection based on F1 score
Multi-seed evaluation : Mean and standard deviation across runs
Result tables : Formatted output with build_table()
Core Metrics
Area Under the Receiver Operating Characteristic curve.
import laft
import torch
# Anomaly scores and ground truth labels
scores = torch.tensor([ 0.1 , 0.3 , 0.8 , 0.9 , 0.2 ])
labels = torch.tensor([ 0 , 0 , 1 , 1 , 0 ])
auroc = laft.binary_auroc(scores, labels)
print ( f "AUROC: { auroc :.3f} " ) # Higher is better (0-1)
Parameters:
input (Tensor): Anomaly scores [N]
target (Tensor): Binary labels [N] (0=normal, 1=anomaly)
Returns:
float: AUROC score (0.0 to 1.0, higher is better)
Area Under the Precision-Recall Curve.
auprc = laft.binary_auprc(scores, labels)
print ( f "AUPRC: { auprc :.3f} " ) # Higher is better (0-1)
Parameters:
input (Tensor): Anomaly scores [N]
target (Tensor): Binary labels [N]
Returns:
float: AUPRC score (0.0 to 1.0, higher is better)
F1 score for binary classification.
# Automatic threshold selection (optimal for F1)
f1 = laft.binary_f1_score(scores, labels, threshold = "auto" )
print ( f "F1: { f1 :.3f} " )
# Manual threshold
f1_manual = laft.binary_f1_score(scores, labels, threshold = 0.5 )
print ( f "F1 (threshold=0.5): { f1_manual :.3f} " )
Parameters:
input (Tensor): Anomaly scores [N]
target (Tensor): Binary labels [N]
threshold (float | “auto”): Classification threshold or “auto” for optimal
Returns:
float: F1 score (0.0 to 1.0, higher is better)
False Positive Rate at 95% True Positive Rate.
fpr95 = laft.binary_fpr95(scores, labels)
print ( f "FPR95: { fpr95 :.3f} " ) # Lower is better (0-1)
Parameters:
input (Tensor): Anomaly scores [N]
target (Tensor): Binary labels [N]
Returns:
float: FPR at 95% recall (0.0 to 1.0, lower is better)
FPR95 measures the false positive rate when the detector is configured to catch 95% of anomalies. Lower values indicate better discrimination.
Classification accuracy.
acc = laft.binary_accuracy(scores, labels, threshold = "auto" )
print ( f "Accuracy: { acc :.3f} " )
Parameters:
input (Tensor): Anomaly scores [N]
target (Tensor): Binary labels [N]
threshold (float | “auto”): Classification threshold
Returns:
float: Accuracy (0.0 to 1.0, higher is better)
Comprehensive Evaluation
Compute multiple metrics at once.
Single Run
Multi-Seed
Custom Metrics
import laft
import torch
scores = torch.randn( 1000 ) # Anomaly scores
labels = torch.randint( 0 , 2 , ( 1000 ,)) # Ground truth
# Compute all default metrics
metrics = laft.binary_metrics(scores, labels)
print ( f "AUROC: { metrics[ 'auroc' ] :.3f} " )
print ( f "AUPRC: { metrics[ 'auprc' ] :.3f} " )
print ( f "FPR95: { metrics[ 'fpr95' ] :.3f} " )
import laft
import torch
# Multiple runs (e.g., different random seeds)
scores_list = [
torch.randn( 1000 ) for _ in range ( 5 )
]
labels = torch.randint( 0 , 2 , ( 1000 ,))
# Compute mean ± std across runs
mean_metrics, std_metrics = laft.binary_metrics(scores_list, labels)
print ( f "AUROC: { mean_metrics[ 'auroc' ] :.3f} ± { std_metrics[ 'auroc' ] :.3f} " )
print ( f "AUPRC: { mean_metrics[ 'auprc' ] :.3f} ± { std_metrics[ 'auprc' ] :.3f} " )
# Specify which metrics to compute
metrics = laft.binary_metrics(
scores, labels,
types = ( "auroc" , "auprc" , "f1" , "accuracy" )
)
print (metrics.keys())
# dict_keys(['auroc', 'auprc', 'f1', 'accuracy'])
Parameters:
input (Tensor | Sequence[Tensor]): Anomaly scores or list of score tensors
target (Tensor): Binary labels [N]
threshold (float | “auto”, optional): Threshold for F1/accuracy
types (Sequence[str], optional): Metrics to compute. Default: ("auroc", "auprc", "fpr95")
Returns:
Single run: BinaryMetrics dictionary
Multi-seed: Tuple of (mean_metrics, std_metrics) dictionaries
Optimal Threshold Selection
Find the threshold that maximizes F1 score.
import laft
import torch
scores = torch.randn( 1000 )
labels = torch.randint( 0 , 2 , ( 1000 ,))
# Find optimal threshold
threshold = laft.optimal_threshold(scores, labels)
print ( f "Optimal threshold: { threshold :.3f} " )
# Use it for F1 score
f1 = laft.binary_f1_score(scores, labels, threshold = threshold)
print ( f "F1 at optimal threshold: { f1 :.3f} " )
Parameters:
input (Tensor): Anomaly scores [N]
target (Tensor): Binary labels [N]
Returns:
float: Threshold value that maximizes F1 score
The “auto” threshold option in binary_f1_score() and binary_accuracy() automatically calls optimal_threshold() internally.
Complete Workflow Example
import laft
import torch
torch.set_grad_enabled( False )
# Load data and model
model, data = laft.get_clip_cached_features(
"ViT-B-16-quickgelu:dfn2b" ,
"color_mnist" ,
splits = [ "train" , "test" ],
dataset_kwargs = { "seed" : 42 }
)
train_features, _ = data[ "train" ]
test_features, test_attrs = data[ "test" ]
# Setup prompts
attend_name, ignore_name, attend_labels, ignore_labels = \
laft.prompts.get_labels( "color_mnist" , test_attrs, "guide_number" )
prompts = laft.prompts.get_prompts( "color_mnist" , "guide_number" )
# Build concept subspace
text_features = model.encode_text(prompts[ "all" ])
pair_diffs = laft.prompt_pair(text_features)
concept_basis = laft.pca(pair_diffs, n_components = 24 )
# Transform features
train_guided = laft.inner(train_features, concept_basis)
test_guided = laft.inner(test_features, concept_basis)
# Compute anomaly scores
scores = laft.knn(train_guided, test_guided, n_neighbors = 30 )
# Comprehensive evaluation
metrics = laft.binary_metrics(
scores,
attend_labels,
types = ( "auroc" , "auprc" , "f1" , "fpr95" )
)
print ( f "Attend ( { attend_name } ):" )
for name, value in metrics.items():
print ( f " { name.upper() } : { value :.3f} " )
Output:
Attend (number):
AUROC: 0.892
AUPRC: 0.856
F1: 0.801
FPR95: 0.213
Result Tables
Create formatted tables for results presentation.
import laft
# Organize metrics by category and method
metrics = {
"number" : {
"Guide/24" : { "auroc" : 0.892 , "auprc" : 0.856 , "fpr95" : 0.213 },
"Guide/50" : { "auroc" : 0.901 , "auprc" : 0.868 , "fpr95" : 0.198 },
},
"color" : {
"Guide/24" : { "auroc" : 0.756 , "auprc" : 0.712 , "fpr95" : 0.334 },
"Guide/50" : { "auroc" : 0.769 , "auprc" : 0.725 , "fpr95" : 0.318 },
}
}
# Build formatted table
table = laft.utils.build_table(
metrics,
group_headers = ( "Method" , "Comp." ),
types = ( "auroc" , "auprc" , "fpr95" )
)
print (table)
Output:
Method Comp. Number Color
AUROC AUPRC FPR95 AUROC AUPRC FPR95
Guide 24 89.2 85.6 21.3 75.6 71.2 33.4
Guide 50 90.1 86.8 19.8 76.9 72.5 31.8
Multi-Seed Tables
For experiments with multiple seeds:
import laft
import numpy as np
# Collect results from multiple seeds
metrics = {
"number" : {
"LAFT/4-shot" : [
{ "auroc" : 0.892 , "auprc" : 0.856 },
{ "auroc" : 0.898 , "auprc" : 0.862 },
{ "auroc" : 0.885 , "auprc" : 0.849 },
{ "auroc" : 0.901 , "auprc" : 0.868 },
{ "auroc" : 0.894 , "auprc" : 0.859 },
],
}
}
# Build table with mean ± std
table = laft.utils.build_table(
metrics,
group_headers = ( "Method" , "#Shot" ),
types = ( "auroc" , "auprc" )
)
print (table)
Output:
Method #Shot Number
AUROC AUPRC
LAFT 4-shot 89.4 ± 0.6 85.9 ± 0.7
save_table(table, path)
Save table to file.
import laft
table = laft.utils.build_table(metrics)
laft.utils.save_table(table, "results/color_mnist/laft.txt" )
Component Sweep Evaluation
Evaluate performance across different numbers of principal components:
import laft
import torch
torch.set_grad_enabled( False )
# Load data
model, data = laft.get_clip_cached_features(
"ViT-B-16-quickgelu:dfn2b" , "waterbirds" , splits = [ "train" , "test" ]
)
train_features, _ = data[ "train" ]
test_features, test_attrs = data[ "test" ]
# Setup
attend_name, _, attend_labels, _ = \
laft.prompts.get_labels( "waterbirds" , test_attrs, "guide_bird" )
prompts = laft.prompts.get_prompts( "waterbirds" , "guide_bird" )
text_features = model.encode_text(prompts[ "all" ])
pair_diffs = laft.prompt_pair(text_features)
concept_basis = laft.pca(pair_diffs) # Compute all components
# Sweep over component counts
results = {}
for n_components in range ( 2 , 101 ):
# Use first n_components
train_guided = laft.inner(train_features, concept_basis[:n_components])
test_guided = laft.inner(test_features, concept_basis[:n_components])
scores = laft.knn(train_guided, test_guided, n_neighbors = 30 )
metrics = laft.binary_metrics(scores, attend_labels)
results[ f "Guide/ { n_components } " ] = metrics
# Build comprehensive table
metrics_dict = {attend_name: results}
table = laft.utils.build_table(
metrics_dict,
group_headers = ( "Method" , "Comp." ),
types = ( "auroc" , "auprc" , "fpr95" )
)
print (table)
laft.utils.save_table(table, "results/waterbirds/component_sweep.txt" )
Statistical Analysis
mean_std(values)
Compute mean and standard deviation.
import laft
auroc_values = [ 0.892 , 0.898 , 0.885 , 0.901 , 0.894 ]
mean, std = laft.mean_std(auroc_values)
print ( f "AUROC: { mean :.3f} ± { std :.3f} " )
# AUROC: 0.894 ± 0.006
metric_mean_std(metrics)
Compute mean and std across metric dictionaries.
import laft
# Results from 5 seeds
metrics_list = [
{ "auroc" : 0.892 , "auprc" : 0.856 , "fpr95" : 0.213 },
{ "auroc" : 0.898 , "auprc" : 0.862 , "fpr95" : 0.198 },
{ "auroc" : 0.885 , "auprc" : 0.849 , "fpr95" : 0.225 },
{ "auroc" : 0.901 , "auprc" : 0.868 , "fpr95" : 0.189 },
{ "auroc" : 0.894 , "auprc" : 0.859 , "fpr95" : 0.206 },
]
mean_metrics, std_metrics = laft.metric_mean_std(metrics_list)
print ( f "AUROC: { mean_metrics[ 'auroc' ] :.3f} ± { std_metrics[ 'auroc' ] :.3f} " )
print ( f "AUPRC: { mean_metrics[ 'auprc' ] :.3f} ± { std_metrics[ 'auprc' ] :.3f} " )
print ( f "FPR95: { mean_metrics[ 'fpr95' ] :.3f} ± { std_metrics[ 'fpr95' ] :.3f} " )
Pixel-Level Evaluation
For industrial anomaly detection with pixel-level annotations:
import laft
import torch
from torch.utils.data import DataLoader
# Assume heatmaps and masks are available
heatmaps = torch.randn( 100 , 224 , 224 ) # Anomaly heatmaps [N, H, W]
masks = torch.randint( 0 , 2 , ( 100 , 224 , 224 )) # Ground truth masks [N, H, W]
# Pixel-level AUROC
pixel_auroc = laft.binary_auroc(heatmaps, masks)
print ( f "Pixel-level AUROC: { pixel_auroc :.3f} " )
# Or use binary_metrics
pixel_metrics = laft.binary_metrics(heatmaps, masks, types = [ "auroc" ])
print ( f "Pixel AUROC: { pixel_metrics[ 'auroc' ] :.3f} " )
The metrics automatically handle both image-level [N] and pixel-level [N, H, W] tensors.
Best Practices
The metrics automatically align tensor devices, so you don’t need to manually move tensors: scores = torch.randn( 1000 ).cuda()
labels = torch.randint( 0 , 2 , ( 1000 ,)) # CPU
# Works automatically
auroc = laft.binary_auroc(scores, labels)
For robust evaluation, run experiments with multiple seeds: results = []
for seed in range ( 5 ):
# Set seed, run experiment
scores = run_experiment(seed)
metrics = laft.binary_metrics(scores, labels)
results.append(metrics)
mean, std = laft.metric_mean_std(results)
print ( f "AUROC: { mean[ 'auroc' ] :.3f} ± { std[ 'auroc' ] :.3f} " )
Threshold Selection : Use threshold="auto" for F1 and accuracy during evaluation. Don’t tune thresholds on test data!
Metric Interpretation
AUROC Measures overall discrimination ability. Good for comparing methods. Not affected by class imbalance. Range : 0.0-1.0 (higher is better)Interpretation : 0.5 = random, 1.0 = perfect
AUPRC Better for imbalanced datasets. Focuses on precision-recall trade-off. Range : 0.0-1.0 (higher is better)Use : When anomalies are rare
FPR95 False positive rate at 95% recall. Lower is better. Range : 0.0-1.0 (lower is better)Use : When high recall is required
F1 Score Harmonic mean of precision and recall. Requires threshold selection. Range : 0.0-1.0 (higher is better)Use : For deployment decisions
Next Steps
Basic Usage Learn the complete LAFT workflow from start to finish
Prompts Understand how to use and create prompts for concept subspaces