Biometric Security Metrics: EER, d-prime, ROC-AUC, and Entropy
Reference for all security and performance metrics used in Neural Vault benchmarks: Equal Error Rate, d-prime, ROC-AUC, key entropy, bit balance, and avalanche effect.
Use this file to discover all available pages before exploring further.
Neural Vault evaluates biometric security along two independent axes: identity separation (how well genuine and impostor distributions are pulled apart) and key quality (how close the derived 256-bit keys are to ideal uniform random strings). The metrics below cover both axes and map directly to the functions in main.py and model.py that compute them. Understanding what each metric measures — and what target values to aim for — is essential for interpreting benchmark output and tuning the system.
The Equal Error Rate is the single most widely cited metric in biometric system evaluation. It identifies the operating threshold at which the False Acceptance Rate (FAR) equals the False Rejection Rate (FRR). Because the two error types trade off against each other as the decision threshold varies, the EER represents the balanced worst-case performance of the system: a system with a low EER is simultaneously good at rejecting impostors and accepting genuine users.Lower EER is always better. A random classifier has EER = 50%; a perfect system has EER = 0%.
Find threshold t∗ such that:FAR(t∗)=FRR(t∗),whereFAR(t)=FP+TNFP,FRR(t)=FN+TPFNIn practice, t∗ is found by root-finding on the interpolated ROC curve:t∗=brentq(λx:1−x−TPR(x),0,1)
Uses scipy.optimize.brentq on a scipy.interpolate.interp1d interpolation of the ROC curve. Returns EER as a percentage (0–100). Returns 50.0 if either input array is empty.
from main import compute_eerimport numpy as npgenuine = np.random.normal(0.8, 0.05, 500) # high similarity scoresimpostor = np.random.normal(0.3, 0.10, 500) # low similarity scoreseer = compute_eer(genuine, impostor)print(f"EER: {eer:.2f}%")
EER values below 1% are considered excellent for a biometric system. Values above 5% suggest that genuine and impostor distributions overlap significantly and the system is unsuitable for high-security applications.
d-prime is a dimensionless separability index borrowed from signal detection theory. It measures how many standard deviations separate the means of the genuine and impostor score distributions, normalised by their pooled standard deviation. Unlike EER, which is a single operating-point metric, d-prime characterises the intrinsic separability of the two distributions regardless of any chosen threshold.Higher d-prime is always better. Values above 3 are considered excellent for a biometric system.
d′=0.5(σgenuine2+σimpostor2)∣μimpostor−μgenuine∣A small stability epsilon 1e-9 is added to the denominator to prevent division by zero when all samples are identical.
from main import compute_d_primeimport numpy as npgenuine = np.random.normal(0.8, 0.05, 500)impostor = np.random.normal(0.3, 0.10, 500)dp = compute_d_prime(genuine, impostor)print(f"d' = {dp:.3f}") # e.g. d' = 5.95
The Area Under the Receiver Operating Characteristic curve summarises performance across all possible decision thresholds. The ROC curve plots True Acceptance Rate (TAR = 1 − FRR) against False Acceptance Rate (FAR) as the threshold varies. AUC = 1.0 indicates perfect separability; AUC = 0.5 equals random guessing.ROC-AUC is used in two distinct contexts within Neural Vault:
Classification AUC — computed per few-shot episode over the one-vs-rest softmax probabilities for all five classes. Reported as the macro-average across 40 episodes.
Biometric verification AUC — computed directly on genuine/impostor cosine distances for the NeuralVault prototype comparison.
Internally labels genuine samples as class 0 and impostor samples as class 1, then calls sklearn.metrics.roc_curve. The returned (fpr, tpr, thresholds) tuple is compatible with sklearn.metrics.auc.
from main import compute_roc_from_scoresfrom sklearn.metrics import aucfpr, tpr, thresholds = compute_roc_from_scores(genuine, impostor)roc_auc = auc(fpr, tpr)print(f"ROC-AUC: {roc_auc:.4f}")
Hamming distance is the primary metric for comparing binary cryptographic keys. It is defined as the fraction of bit positions that differ between two keys of equal length. Neural Vault uses scipy.spatial.distance.cdist with metric='hamming' to compute pairwise Hamming distances between reference prototype keys and per-sample keys.dH(a,b)=n1∑i=1n1[ai=bi]For a 256-bit key, the value ranges from 0 (identical keys) to 1 (all bits differ). Random 256-bit keys have an expected Hamming distance of exactly 0.5.
A large gap between genuine and impostor Hamming distances (visible as a large d-prime on the Hamming distance distributions) confirms that the key generation method is both stable and discriminative.
Cosine similarity measures the angle between two embedding vectors on the unit hypersphere. Because NeuralVaultFewShot L2-normalises all output embeddings, the cosine similarity reduces to a plain dot product, making it computationally efficient.cos(a,b)=∥a∥∥b∥a⋅b=a⋅b(since ∥a∥=∥b∥=1)scipy.spatial.distance.cdist computes cosine distance (1 - cosine_similarity), so the genuine and impostor arrays produced by verify_similarity use distance — a value of 0 means identical direction, and 1 means orthogonal.
From benchmark results, the NeuralVault cosine-similarity EER threshold is approximately 0.316. Samples with cosine similarity above this value are accepted as genuine; those below are rejected.
The EER threshold is dataset-dependent. Always re-compute it by running compute_eer on genuine and impostor cosine similarity scores from your own enrollment data rather than relying on the default 0.316.
A cryptographic key is only as strong as its randomness. Even a 256-bit key that always has its first 128 bits set to zero has only 128 bits of effective entropy. Neural Vault measures two complementary aspects of key quality per bit position across a population of derived keys.
For each bit position i, let pi be the fraction of keys in the population where bit i is 1:Hi=−pilog2pi−(1−pi)log2(1−pi)The reported entropy is the mean of Hi across all n bit positions:H=n1∑i=1nHiAn ideal key has H=1.0 bits/bit — every position is equally likely to be 0 or 1 across the user population.
Balance is a simpler, linear measure of the same property:balance=1−2⋅meani(∣pi−0.5∣)A value of 1.0 means every bit position has exactly 50% probability of being 1; a value of 0.0 means every bit is always the same.
from main import compute_key_entropy_balance, raw_key_to_bitarrayimport numpy as np# Collect keys from all samplesbit_rows = np.vstack([raw_key_to_bitarray(k) for k in key_list])entropy, balance = compute_key_entropy_balance(bit_rows)print(f"Entropy: {entropy:.4f} bits/bit") # ideal: 1.0print(f"Balance: {balance:.4f}") # ideal: 1.0
Entropy and balance are population-level statistics — you need keys derived from multiple distinct users or sessions to compute meaningful values. A single key always has balance = either 0 or 1 per bit position.
The avalanche effect measures the cryptographic sensitivity of key derivation: a small change in the input embedding should flip approximately 50% of the output key bits. This property is essential for preventing an attacker from inferring nearby keys by interpolating from a known one.
A perturbation of 0.05 in a single embedding dimension (out of latent_dim = 40) triggers approximately 50% bit-flip rate in the derived 256-bit key. This confirms that HKDF-SHA256 provides a strong diffusion layer over the quantised embedding bytes.
An avalanche rate well below 50% would indicate that the key derivation is insufficiently sensitive, allowing an attacker to reconstruct nearby keys from a stolen prototype.
False Acceptance Rate and False Rejection Rate are the fundamental operating-point metrics from which EER is derived. At any fixed decision threshold t:FAR(t)=total impostorsimpostors accepted,FRR(t)=total genuine usersgenuine users rejectedThe two rates trade off: lowering the threshold to be more permissive reduces FRR but increases FAR, and vice versa. The EER threshold is the unique point where both rates are equal.
FAR and FRR at the EER threshold are equal by definition, but float arithmetic means they may differ by a small rounding error. The EER value reported by compute_eer is the average of the two at the nearest discrete threshold.