Biometric Security Metrics: EER, d-prime, ROC-AUC, and Entropy

Neural Vault evaluates biometric security along two independent axes: identity separation (how well genuine and impostor distributions are pulled apart) and key quality (how close the derived 256-bit keys are to ideal uniform random strings). The metrics below cover both axes and map directly to the functions in main.py and model.py that compute them. Understanding what each metric measures — and what target values to aim for — is essential for interpreting benchmark output and tuning the system.

Equal Error Rate (EER)

The Equal Error Rate is the single most widely cited metric in biometric system evaluation. It identifies the operating threshold at which the False Acceptance Rate (FAR) equals the False Rejection Rate (FRR). Because the two error types trade off against each other as the decision threshold varies, the EER represents the balanced worst-case performance of the system: a system with a low EER is simultaneously good at rejecting impostors and accepting genuine users. Lower EER is always better. A random classifier has EER = 50%; a perfect system has EER = 0%.

Formula

Find threshold

t^*

such that:

\text{FAR}(t^*) = \text{FRR}(t^*), \quad \text{where} \quad \text{FAR}(t) = \frac{FP}{FP + TN}, \quad \text{FRR}(t) = \frac{FN}{FN + TP}

In practice,

t^*

is found by root-finding on the interpolated ROC curve:

t^* = \text{brentq}\!\left(\lambda\, x:\; 1 - x - \text{TPR}(x),\; 0,\; 1\right)

`compute_eer`

def compute_eer(genuine: np.ndarray, impostor: np.ndarray) -> float

Uses scipy.optimize.brentq on a scipy.interpolate.interp1d interpolation of the ROC curve. Returns EER as a percentage (0–100). Returns 50.0 if either input array is empty.

from main import compute_eer
import numpy as np

genuine  = np.random.normal(0.8, 0.05, 500)   # high similarity scores
impostor = np.random.normal(0.3, 0.10, 500)   # low similarity scores
eer = compute_eer(genuine, impostor)
print(f"EER: {eer:.2f}%")

Benchmark Results

Method	EER
Neural (binarised embeddings)	0.75%
NeuralVault (cosine similarity)	0.94%
BioHashing	varies
SHA256	varies
HMAC	varies

EER values below 1% are considered excellent for a biometric system. Values above 5% suggest that genuine and impostor distributions overlap significantly and the system is unsuitable for high-security applications.

d-prime (d′)

d-prime is a dimensionless separability index borrowed from signal detection theory. It measures how many standard deviations separate the means of the genuine and impostor score distributions, normalised by their pooled standard deviation. Unlike EER, which is a single operating-point metric, d-prime characterises the intrinsic separability of the two distributions regardless of any chosen threshold. Higher d-prime is always better. Values above 3 are considered excellent for a biometric system.

Formula

d' = \frac{|\mu_{\text{impostor}} - \mu_{\text{genuine}}|}{\sqrt{0.5\,(\sigma^2_{\text{genuine}} + \sigma^2_{\text{impostor}})}}

A small stability epsilon 1e-9 is added to the denominator to prevent division by zero when all samples are identical.

Interpretation Guide

d-prime	Separation Quality
> 3.0	Excellent — distributions barely overlap
1.0 – 3.0	Good — usable for most biometric applications
< 1.0	Fair — significant overlap, high error rates expected

`compute_d_prime`

def compute_d_prime(genuine: np.ndarray, impostor: np.ndarray) -> float

from main import compute_d_prime
import numpy as np

genuine  = np.random.normal(0.8, 0.05, 500)
impostor = np.random.normal(0.3, 0.10, 500)
dp = compute_d_prime(genuine, impostor)
print(f"d' = {dp:.3f}")   # e.g. d' = 5.95

Benchmark Results

Method	d-prime
Neural (binarised embeddings)	5.95
NeuralVault (cosine similarity)	4.84

ROC-AUC

The Area Under the Receiver Operating Characteristic curve summarises performance across all possible decision thresholds. The ROC curve plots True Acceptance Rate (TAR = 1 − FRR) against False Acceptance Rate (FAR) as the threshold varies. AUC = 1.0 indicates perfect separability; AUC = 0.5 equals random guessing. ROC-AUC is used in two distinct contexts within Neural Vault:

Classification AUC — computed per few-shot episode over the one-vs-rest softmax probabilities for all five classes. Reported as the macro-average across 40 episodes.
Biometric verification AUC — computed directly on genuine/impostor cosine distances for the NeuralVault prototype comparison.

`compute_roc_from_scores`

def compute_roc_from_scores(
    genuine: np.ndarray,
    impostor: np.ndarray,
) -> tuple[np.ndarray, np.ndarray, np.ndarray]

Internally labels genuine samples as class 0 and impostor samples as class 1, then calls sklearn.metrics.roc_curve. The returned (fpr, tpr, thresholds) tuple is compatible with sklearn.metrics.auc.

from main import compute_roc_from_scores
from sklearn.metrics import auc

fpr, tpr, thresholds = compute_roc_from_scores(genuine, impostor)
roc_auc = auc(fpr, tpr)
print(f"ROC-AUC: {roc_auc:.4f}")

Benchmark Results

Context	AUC
NeuralVaultFewShot (classification)	0.9995
NeuralVault (cosine verification)	≈ 1.0000

Hamming Distance

Hamming distance is the primary metric for comparing binary cryptographic keys. It is defined as the fraction of bit positions that differ between two keys of equal length. Neural Vault uses scipy.spatial.distance.cdist with metric='hamming' to compute pairwise Hamming distances between reference prototype keys and per-sample keys.

d_H(a, b) = \frac{1}{n}\sum_{i=1}^{n} \mathbb{1}[a_i \neq b_i]

For a 256-bit key, the value ranges from 0 (identical keys) to 1 (all bits differ). Random 256-bit keys have an expected Hamming distance of exactly 0.5.

Interpretation

Scenario	Expected Hamming Distance
Genuine pair (same user, same class)	Near 0.0 — stable key
Impostor pair (different user/class)	Near 0.5 — effectively random

A large gap between genuine and impostor Hamming distances (visible as a large d-prime on the Hamming distance distributions) confirms that the key generation method is both stable and discriminative.

Usage in `evaluate_keygen_method`

genuine_dists.extend(
    cdist(ref_key_matrix, gen_keys, metric='hamming').flatten()
)
impostor_dists.extend(
    cdist(ref_key_matrix, imp_keys, metric='hamming').flatten()
)

The resulting genuine_dists and impostor_dists arrays feed directly into compute_eer and compute_d_prime.

Cosine Similarity

Cosine similarity measures the angle between two embedding vectors on the unit hypersphere. Because NeuralVaultFewShot L2-normalises all output embeddings, the cosine similarity reduces to a plain dot product, making it computationally efficient.

\cos(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\|\,\|\mathbf{b}\|} = \mathbf{a} \cdot \mathbf{b} \quad (\text{since } \|\mathbf{a}\|=\|\mathbf{b}\|=1)

scipy.spatial.distance.cdist computes cosine distance (1 - cosine_similarity), so the genuine and impostor arrays produced by verify_similarity use distance — a value of 0 means identical direction, and 1 means orthogonal.

EER Decision Threshold

From benchmark results, the NeuralVault cosine-similarity EER threshold is approximately 0.316. Samples with cosine similarity above this value are accepted as genuine; those below are rejected.

# model.py verification demo
score = np.dot(user_emb, prototype_vec)   # cosine similarity (high = genuine)
eer_thr = 0.316  # cosine similarity at EER
if score > eer_thr:
    print("Authenticated")
else:
    print("Rejected")

The EER threshold is dataset-dependent. Always re-compute it by running compute_eer on genuine and impostor cosine similarity scores from your own enrollment data rather than relying on the default 0.316.

Key Entropy and Bit Balance

A cryptographic key is only as strong as its randomness. Even a 256-bit key that always has its first 128 bits set to zero has only 128 bits of effective entropy. Neural Vault measures two complementary aspects of key quality per bit position across a population of derived keys.

Shannon Entropy (per-bit)

For each bit position

i

, let

p_i

be the fraction of keys in the population where bit

i

is 1:

H_i = -p_i \log_2 p_i - (1-p_i)\log_2(1-p_i)

The reported entropy is the mean of

H_i

across all

n

bit positions:

H = \frac{1}{n}\sum_{i=1}^{n} H_i

An ideal key has

H = 1.0

bits/bit — every position is equally likely to be 0 or 1 across the user population.

Bit Balance

Balance is a simpler, linear measure of the same property:

\text{balance} = 1 - 2 \cdot \text{mean}_i(|p_i - 0.5|)

A value of 1.0 means every bit position has exactly 50% probability of being 1; a value of 0.0 means every bit is always the same.

`compute_key_entropy_balance`

def compute_key_entropy_balance(bit_matrix: np.ndarray) -> tuple[float, float]

from main import compute_key_entropy_balance, raw_key_to_bitarray
import numpy as np

# Collect keys from all samples
bit_rows = np.vstack([raw_key_to_bitarray(k) for k in key_list])
entropy, balance = compute_key_entropy_balance(bit_rows)
print(f"Entropy: {entropy:.4f} bits/bit")   # ideal: 1.0
print(f"Balance: {balance:.4f}")             # ideal: 1.0

Entropy and balance are population-level statistics — you need keys derived from multiple distinct users or sessions to compute meaningful values. A single key always has balance = either 0 or 1 per bit position.

Avalanche Effect

The avalanche effect measures the cryptographic sensitivity of key derivation: a small change in the input embedding should flip approximately 50% of the output key bits. This property is essential for preventing an attacker from inferring nearby keys by interpolating from a known one.

Measurement

From model.py:

def bit_diff(a: bytes, b: bytes) -> float:
    return (int.from_bytes(a, 'big') ^ int.from_bytes(b, 'big')).bit_count() / (len(a) * 8)

perturb = prototype_vec.copy()
perturb[0] += 0.05                        # 0.05-unit shift in one dimension
avalanche_key, _ = derive_key(perturb)
avalanche_pct = bit_diff(base_key_bytes, avalanche_key) * 100

A perturbation of 0.05 in a single embedding dimension (out of latent_dim = 40) triggers approximately 50% bit-flip rate in the derived 256-bit key. This confirms that HKDF-SHA256 provides a strong diffusion layer over the quantised embedding bytes.

Benchmark Result

Perturbation	Bit-flip Rate
+0.05 in dimension 0	≈ 50%

An avalanche rate well below 50% would indicate that the key derivation is insufficiently sensitive, allowing an attacker to reconstruct nearby keys from a stolen prototype.

FAR and FRR

False Acceptance Rate and False Rejection Rate are the fundamental operating-point metrics from which EER is derived. At any fixed decision threshold

t

\text{FAR}(t) = \frac{\text{impostors accepted}}{\text{total impostors}}, \qquad \text{FRR}(t) = \frac{\text{genuine users rejected}}{\text{total genuine users}}

The two rates trade off: lowering the threshold to be more permissive reduces FRR but increases FAR, and vice versa. The EER threshold is the unique point where both rates are equal.

At the EER Threshold

Method	FAR @ EER	FRR @ EER
Neural (Hamming)	≈ 0.75%	≈ 0.75%
NeuralVault (cosine)	≈ 0.94%	≈ 0.94%

In the Neural Vault benchmark these values are computed and printed directly:

# model.py
far = fpr_arr[eer_idx]
frr = fnr_arr[eer_idx]
print(f"FAR @ EER: {far*100:.3f}%")
print(f"FRR @ EER: {frr*100:.3f}%")

FAR and FRR at the EER threshold are equal by definition, but float arithmetic means they may differ by a small rounding error. The EER value reported by compute_eer is the average of the two at the nearest discrete threshold.

Overview

Getting Started

Pipeline

Benchmarking

Reference

Biometric Security Metrics: EER, d-prime, ROC-AUC, and Entropy

Equal Error Rate (EER)

Formula

`compute_eer`

Benchmark Results

d-prime (d′)

Formula

Interpretation Guide

`compute_d_prime`

Benchmark Results

ROC-AUC

`compute_roc_from_scores`

Benchmark Results

Hamming Distance

Interpretation

Usage in `evaluate_keygen_method`

Cosine Similarity

EER Decision Threshold

Key Entropy and Bit Balance

Shannon Entropy (per-bit)

Bit Balance

`compute_key_entropy_balance`

Avalanche Effect

Measurement

Benchmark Result

FAR and FRR

At the EER Threshold

Build docs developers (and LLMs) love

Overview

Getting Started

Pipeline

Benchmarking

Reference

Documentation Index

​Equal Error Rate (EER)

​Formula

​compute_eer

​Benchmark Results

​d-prime (d′)

​Formula

​Interpretation Guide

​compute_d_prime

​Benchmark Results

​ROC-AUC

​compute_roc_from_scores

​Benchmark Results

​Hamming Distance

​Interpretation

​Usage in evaluate_keygen_method

​Cosine Similarity

​EER Decision Threshold

​Key Entropy and Bit Balance

​Shannon Entropy (per-bit)

​Bit Balance

​compute_key_entropy_balance

​Avalanche Effect

​Measurement

​Benchmark Result

​FAR and FRR

​At the EER Threshold

Build docs developers (and LLMs) love

Equal Error Rate (EER)

Formula

`compute_eer`

Benchmark Results

d-prime (d′)

Formula

Interpretation Guide

`compute_d_prime`

Benchmark Results

ROC-AUC

`compute_roc_from_scores`

Benchmark Results

Hamming Distance

Interpretation

Usage in `evaluate_keygen_method`

Cosine Similarity

EER Decision Threshold

Key Entropy and Bit Balance

Shannon Entropy (per-bit)

Bit Balance

`compute_key_entropy_balance`

Avalanche Effect

Measurement

Benchmark Result

FAR and FRR

At the EER Threshold