Skip to main content
Membrane exposes behavioral metrics via GetMetrics and ships a comprehensive evaluation suite covering retrieval quality, revision semantics, decay curves, trust gating, and vector-aware recall.

GetMetrics

GetMetrics returns a point-in-time *metrics.Snapshot collected by scanning all records in the store.
snap, err := m.GetMetrics(ctx)
if err != nil {
    log.Fatal(err)
}
fmt.Printf("Total: %d, Usefulness: %.2f\n", snap.TotalRecords, snap.RetrievalUsefulness)

Example snapshot

{
  "collected_at": "2026-02-05T14:23:10Z",
  "total_records": 142,
  "records_by_type": {
    "episodic": 80,
    "semantic": 35,
    "competence": 15,
    "plan_graph": 7,
    "working": 5
  },
  "avg_salience": 0.62,
  "avg_confidence": 0.78,
  "salience_distribution": {
    "0.0-0.2": 12,
    "0.2-0.4": 18,
    "0.4-0.6": 30,
    "0.6-0.8": 45,
    "0.8-1.0": 37
  },
  "active_records": 130,
  "pinned_records": 3,
  "total_audit_entries": 890,
  "memory_growth_rate": 0.15,
  "retrieval_usefulness": 0.42,
  "competence_success_rate": 0.85,
  "plan_reuse_frequency": 2.3,
  "revision_rate": 0.08
}

Metrics reference

MetricTypeDescription
collected_atstringRFC 3339 timestamp when the snapshot was collected
total_recordsintTotal number of records in the store
records_by_typemap[string]intCount of records per memory type
avg_saliencefloat64Mean salience across all records
avg_confidencefloat64Mean confidence across all records
salience_distributionmap[string]intRecord counts in 0.2-wide salience buckets
active_recordsintRecords with salience > 0
pinned_recordsintRecords with lifecycle.pinned = true
total_audit_entriesintTotal audit log entries across all records
memory_growth_ratefloat64Fraction of records created in the last 24 hours
retrieval_usefulnessfloat64Ratio of reinforce audit actions to total audit entries
competence_success_ratefloat64Average SuccessRate across all competence records
plan_reuse_frequencyfloat64Average ExecutionCount across all plan_graph records
revision_ratefloat64Fraction of audit entries that are supersede, fork, or merge operations

Interpreting key metrics

retrieval_usefulness — A high value (near 1.0) means retrieved records are frequently reinforced after use, indicating good retrieval quality. A low value may indicate the retrieval is surfacing records that aren’t actually useful. memory_growth_rate — Tracks how fast the substrate is growing. A rate near 1.0 means almost all records were created in the last 24 hours, which may indicate runaway ingestion or a new agent start. revision_rate — Measures how often the knowledge base is being revised. A very low rate may indicate the agent isn’t learning from feedback; a very high rate may indicate instability in the knowledge base.

Evaluation suite

The eval suite covers functional correctness across all major subsystems.

Run everything

make eval-all
This runs all Go-based eval tests and the vector end-to-end evaluation script.

Targeted capability evals

make eval-typed          # Memory type handling
make eval-revision       # Revision semantics
make eval-decay          # Decay curves and pruning
make eval-trust          # Trust-gated retrieval
make eval-competence     # Competence learning
make eval-plan           # Plan graph operations
make eval-consolidation  # Episodic consolidation
make eval-metrics        # Observability metrics
make eval-invariants     # System invariants
make eval-grpc           # gRPC endpoint coverage
Each target maps to a go test run against the ./tests package with a specific -run filter.

Recall regression tests

The recall regression test validates that the retrieval layer returns expected records given a known corpus:
go test ./tests -run TestRetrievalRecallAtK
This test checks recall@k for a fixed set of records and queries, failing if recall drops below the configured threshold. Use it as a canary to detect regressions in retrieval ordering or trust filtering.

Vector end-to-end metrics

The vector E2E evaluation requires Python and measures recall, precision, MRR, and NDCG over a synthetic corpus with pgvector-backed retrieval.
1

Install Python dependencies

python3 -m pip install -r tools/eval/requirements.txt
2

Run the eval

make eval
This invokes tools/eval/run.sh, which spins up the eval corpus and measures retrieval metrics at multiple k values.
3

Check the results

The script reports recall@k, precision@k, MRR@k, and NDCG@k and fails with a non-zero exit code if any metric falls below its threshold.

Metric definitions

MetricDescription
recall@kFraction of relevant records found in the top-k results
precision@kFraction of top-k results that are relevant
MRR@kMean reciprocal rank of the first relevant result in top-k
NDCG@kNormalized discounted cumulative gain; measures ranking quality

Environment variable overrides

Thresholds can be overridden via environment variables:
VariableDefaultDescription
MEMBRANE_EVAL_MIN_RECALL0.90Minimum acceptable recall@k
MEMBRANE_EVAL_MIN_PRECISION0.20Minimum acceptable precision@k
MEMBRANE_EVAL_MIN_MRR0.90Minimum acceptable MRR@k
MEMBRANE_EVAL_MIN_NDCG0.90Minimum acceptable NDCG@k

Latest benchmark results

Local run (Feb 5, 2026):
SuiteResult
Unit/Integration22 top-level eval tests + 7 subtests = 29 test cases, 0 failures (~0.40s)
Vector E2E35 records, 18 queries — recall@k 1.000, precision@k 0.267, MRR@k 0.956, NDCG@k 0.955
End-to-end recall depends on ingestion quality, trust filters, and reinforcement behavior. Treat recall tests as scenario-level regression guards rather than universal benchmarks.

Build docs developers (and LLMs) love