What is LAFT?
LAFT (Language-Assisted Feature Transformation) is a novel approach for anomaly detection that uses natural language guidance to transform image features extracted from vision-language models like CLIP. By leveraging semantic understanding from text prompts, LAFT can guide or suppress specific visual concepts to improve anomaly detection performance.LAFT was published at ICLR 2025: “Language-Assisted Feature Transformation for Anomaly Detection” by EungGu Yun, Heonjin Ha, Yeongwoo Nam, and Bryan Dongik Lee.
Core Methodology
The LAFT methodology consists of three key stages:1. Text Prompt Design
Define descriptive text prompts that capture the semantic concepts relevant to your detection task:2. Concept Subspace Construction
LAFT constructs a concept subspace that captures the semantic direction of interest by:- Computing pairwise differences between prompt embeddings
- Aligning these difference vectors
- Extracting principal components via PCA
concept_basis is an orthonormal basis that spans the subspace representing the semantic concept.
3. Feature Transformation
Once the concept subspace is constructed, LAFT applies one of two projection operations: Inner Projection (Guide) - Projects features onto the concept subspace:The LAFT Workflow
Vision-Language Models and CLIP
LAFT leverages CLIP (Contrastive Language-Image Pre-training), a vision-language model that learns joint representations of images and text. CLIP consists of:- Image Encoder: Extracts visual features from images (e.g., ViT-B/16)
- Text Encoder: Extracts semantic features from text (e.g., Transformer)
- Shared Embedding Space: Both modalities are projected into a common space where semantically similar images and text have high cosine similarity
The LAFT CLIP wrapper in
laft/clip.py extends OpenCLIP with convenient ensemble encoding - when you pass a list of prompt lists, it automatically averages embeddings for each group.Key Advantages
Language Guidance
Use natural language to specify which visual concepts to emphasize or suppress, without requiring labeled training data for those concepts.
Flexible Transformations
Choose between inner (guide) and orthogonal (ignore) projections based on your detection objective.
Interpretable
The concept subspace has clear semantic meaning derived from your text prompts, making the transformation interpretable.
Few-Shot Capable
Works effectively even with limited normal training samples, leveraging CLIP’s pre-trained knowledge.
Use Cases
Semantic Anomaly Detection
Detect spurious correlations in datasets like Waterbirds:- Guide bird type: Emphasize bird species while ignoring background
- Ignore background: Suppress background features to focus on the bird
Industrial Defect Detection
Identify manufacturing defects in products:- Encode “damaged”, “scratched”, “broken” states
- Project away from normal “flawless”, “perfect” states
Mathematical Formulation
Given:- Image features from CLIP image encoder
- Concept basis (orthonormal rows)
Next Steps
Feature Transformation
Deep dive into inner and orthogonal projections
Concept Subspace
Learn how to construct effective concept subspaces
