Training NeuralVaultFewShot with Triplet Metric Learning

Neural Vault’s core encoder is NeuralVaultFewShot — a compact Transformer network trained to map fMRI feature vectors into a metric space where embeddings from the same identity cluster tightly together and embeddings from different identities are pushed apart. Training uses the triplet loss objective and episodic sampling: each mini-batch contains anchor, positive, and negative samples assembled on the fly from the labeled dataset. The result is a unit-hypersphere embedding space where cosine distance directly measures biometric similarity.

Architecture

NeuralVaultFewShot composes four stages: a linear feature projection, sinusoidal positional encoding, a stack of Transformer encoder layers, and a multi-layer embedding head that projects the temporally-pooled representation down to latent_dim dimensions before L2 normalization.

class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, max_len: int = 512):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        pos = torch.arange(max_len).unsqueeze(1).float()
        div = torch.exp(
            torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(pos * div)
        pe[:, 1::2] = torch.cos(pos * div)
        self.register_buffer("pe", pe.unsqueeze(0))   # (1, max_len, d_model)

    def forward(self, x):                              # x: (B, T, D)
        return x + self.pe[:, :x.size(1)]

PositionalEncoding uses the classic sinusoidal formulation from “Attention Is All You Need”. Even and odd dimensions receive sin and cos encodings at geometrically spaced frequencies. The buffer is registered (not a parameter) so it moves to the correct device with .to(device) without participating in backpropagation.

class NeuralVaultFewShot(nn.Module):
    """
    Prototypical Transformer Network:
    Feature projection → Positional Encoding → Transformer Encoder → Embedding Head.
    Learns a metric space where genuine samples cluster together.
    """
    def __init__(self, input_dim, d_model=128,
                 nhead=8, n_layers=2, latent_dim=128):
        super().__init__()
        self.proj_in   = nn.Linear(input_dim, d_model)
        self.pos_enc   = PositionalEncoding(d_model)
        enc_layer      = nn.TransformerEncoderLayer(
                            d_model=d_model, nhead=nhead,
                            dim_feedforward=d_model * 4,
                            dropout=0.1, batch_first=True, norm_first=True)
        self.transformer = nn.TransformerEncoder(enc_layer, num_layers=n_layers)

        self.embedding_head = nn.Sequential(
            nn.LayerNorm(d_model),
            nn.Linear(d_model, d_model),
            nn.GELU(),
            nn.Linear(d_model, latent_dim)
        )

    def forward(self, x):
        """x: (B, T, F) or (B, F) → Embedding (B, latent_dim)"""
        if x.dim() == 2:
            x = x.unsqueeze(1)
        h = self.proj_in(x)
        h = self.pos_enc(h)
        h = self.transformer(h)
        h = h.mean(dim=1)  # Temporal pooling
        return F.normalize(self.embedding_head(h), p=2, dim=1)

Key architectural choices:

norm_first=True — applies layer normalization before the attention sub-layer (Pre-LN), which improves gradient flow in shallow stacks.
dim_feedforward=d_model * 4 — follows the standard 4× expansion ratio used in the original Transformer paper.
h.mean(dim=1) — mean pooling over the time axis aggregates information from all frames into a single context vector before the embedding head.
F.normalize(..., p=2, dim=1) — L2-normalizes the output so every embedding lies on the unit hypersphere. On this manifold, dot products equal cosine similarities, making verification a simple inner product.

Triplet Loss

The training objective is the standard margin-based triplet loss:

@staticmethod
def triplet_loss(anchor, positive, negative, margin=0.3):
    """Standard Triplet Loss for Metric Learning"""
    pos_dist = (anchor - positive).pow(2).sum(1)
    neg_dist = (anchor - negative).pow(2).sum(1)
    return F.relu(pos_dist - neg_dist + margin).mean()

The loss is zero when the squared Euclidean distance to the negative exceeds the distance to the positive by at least margin=0.3. F.relu ensures only violated constraints contribute to the gradient — once a triplet is well-separated, it stops driving updates. Loss is averaged over the batch.

Training Workflow

Initialize model and optimizer

Instantiate NeuralVaultFewShot with input_dim set to the number of features in the scaled data matrix, and configure AdamW with lr=1e-4 and weight_decay=1e-4:

model = NeuralVaultFewShot(
    input_dim=X_scaled.shape[1],
    d_model=128,
    nhead=8,
    n_layers=2,
    latent_dim=128
)
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-4,
    weight_decay=1e-4
)
X_tensor = build_sequence_tensor(X_scaled)
class_indices = {c: np.where(labels_np == c)[0] for c in range(N_CLASSES)}
epochs = 100
batch_size = min(128, len(X_tensor))

class_indices caches a per-class index array upfront so episodic sampling during the loop avoids repeated np.where calls.

Episodic triplet sampling

Each epoch, sample batch_size anchor indices at random, then for each anchor select a positive from the same class and a negative from a different class:

for epoch in range(1, epochs + 1):
    model.train()
    anchor_idx = np.random.choice(len(X_tensor), batch_size, replace=False)
    pos_idx = []
    neg_idx = []

    for idx in anchor_idx:
        cls = labels_np[idx]
        same_class = class_indices[cls]
        if len(same_class) > 1:
            pos_candidates = same_class[same_class != idx]
            pos_idx.append(np.random.choice(pos_candidates))
        else:
            pos_idx.append(idx)

        neg_cls = np.random.choice([c for c in range(N_CLASSES) if c != cls])
        neg_idx.append(np.random.choice(class_indices[neg_cls]))

    anc = X_tensor[anchor_idx]
    pos = X_tensor[pos_idx]
    neg = X_tensor[neg_idx]

The self-exclusion check (same_class[same_class != idx]) prevents degenerate triplets where the anchor and positive are the same sample, which would contribute a zero positive distance. If a class has only one member, the anchor is reused as the positive — a known safe fallback that produces zero positive loss for that triplet.

Forward pass and loss computation

Run all three legs of the triplet through the model, compute embeddings, and evaluate the triplet loss:

optimizer.zero_grad()
emb_anc = model(anc)
emb_pos = model(pos)
emb_neg = model(neg)
loss = NeuralVaultFewShot.triplet_loss(emb_anc, emb_pos, emb_neg)

Backward pass with gradient clipping

Backpropagate, clip gradients to prevent exploding norms, and step the optimizer:

loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()

clip_grad_norm_ rescales the entire parameter gradient vector so its L2 norm does not exceed 1.0. This is particularly important in the early epochs when triplet distances are large and gradients can spike.

model.py Variant: Cosine Annealing and Noise Negatives

The model.py training loop uses an alternative triplet strategy: negatives are drawn from random Gaussian noise rather than from labeled impostor classes. This is appropriate for single-user enrollment where no labeled impostor data is available at training time. It also uses a cosine annealing learning rate schedule over 150 epochs:

optimiser = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimiser, T_max=150)

for epoch in trange(150, desc="epoch"):
    indices = torch.randperm(len(user_seqs))
    anc_idx = indices[:len(user_seqs)//2]
    pos_idx = indices[len(user_seqs)//2 : (len(user_seqs)//2)*2]

    anc = user_seqs[anc_idx].to(DEVICE)
    pos = user_seqs[pos_idx].to(DEVICE)
    # Negative: random Gaussian noise or drifted user data
    neg = torch.randn_like(anc) * 1.5

    optimiser.zero_grad()
    emb_anc = model(anc)
    emb_pos = model(pos)
    emb_neg = model(neg)

    loss = NeuralVaultFewShot.triplet_loss(emb_anc, emb_pos, emb_neg)
    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimiser.step()
    scheduler.step()

CosineAnnealingLR with T_max=150 smoothly decays the learning rate from 1e-4 to near zero over the full training run, helping the model converge to a tighter minimum without abrupt learning rate drops.

Few-Shot Evaluation

After training, model quality is assessed through episodic few-shot classification. evaluate_fewshot computes per-class prototype embeddings from a small support set and classifies queries by nearest prototype using Euclidean distance:

def euclidean_dist(x, y):
    """Computes pairwise Euclidean distances between query embeddings and prototypes."""
    n, m, d = x.size(0), y.size(0), x.size(1)
    if d != y.size(1):
        raise ValueError("Embedding feature-dimension mismatch.")
    x = x.unsqueeze(1).expand(n, m, d)
    y = y.unsqueeze(0).expand(n, m, d)
    return torch.pow(x - y, 2).sum(2)

def evaluate_fewshot(model, X_train, y_train, X_test, y_test):
    """Predicts query tokens based on minimum distance to generated anchors."""
    model.eval()
    with torch.no_grad():
        X_tr = torch.from_numpy(X_train).float().unsqueeze(1)
        y_tr = torch.from_numpy(y_train).long()
        X_te = torch.from_numpy(X_test).float().unsqueeze(1)

        emb_tr = model(X_tr)
        emb_te = model(X_te)

        prototypes = torch.stack([emb_tr[y_tr == c].mean(0) for c in range(N_CLASSES)])
        dists = euclidean_dist(emb_te, prototypes)
        preds = torch.argmin(dists, dim=1).numpy()
        probs = F.softmax(-dists, dim=1).numpy()
    return preds, probs

euclidean_dist expands both tensors using broadcasting to compute all n × m pairwise distances in a single vectorized operation. The negative distance is passed through softmax to produce class probability estimates used for ROC-AUC calculation. Across 40 evaluation episodes with 4-shot support sets, Neural Vault achieves Accuracy 98.12%, F1 0.9810, and ROC-AUC 0.9995.

Overview

Getting Started

Pipeline

Benchmarking

Reference

Training NeuralVaultFewShot with Triplet Metric Learning

Architecture

Triplet Loss

Training Workflow

model.py Variant: Cosine Annealing and Noise Negatives

Few-Shot Evaluation

Build docs developers (and LLMs) love

Overview

Getting Started

Pipeline

Benchmarking

Reference

Documentation Index

​Architecture

​Triplet Loss

​Training Workflow

​model.py Variant: Cosine Annealing and Noise Negatives

​Few-Shot Evaluation

Build docs developers (and LLMs) love

Architecture

Triplet Loss

Training Workflow

model.py Variant: Cosine Annealing and Noise Negatives

Few-Shot Evaluation