Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/HugoX2003/nisira-assistant/llms.txt

Use this file to discover all available pages before exploring further.

NISIRA includes a fully self-contained evaluation layer that measures retrieval quality and answer fidelity for every query — without calling external APIs or requiring an OpenAI key. The RAGAS dependency has been removed and replaced by CustomMetricsEvaluator (api/custom_evaluator.py), which computes six accuracy metrics locally alongside the latency metrics already captured by MetricsTracker (api/metrics_tracker.py). All data is persisted in two Django models — QueryMetrics and RAGASMetrics — and surfaced through the Admin Panel and a JSON API endpoint.

Data Models

QueryMetrics — Per-query latency

Saved automatically after every rag_enhanced_chat request by MetricsTracker.save_metrics().
FieldTypeDescription
query_idCharFieldUnique identifier for the query
query_textTextFieldFull question text
total_latencyFloatFieldWall-clock time from question receipt to response complete (seconds)
time_to_first_tokenFloatFieldTime until the first LLM token is streamed (seconds)
retrieval_timeFloatFieldTime spent in vector store search (seconds)
generation_timeFloatFieldTime spent in LLM generation (seconds)
is_complex_queryBooleanFieldTrue when query is classified as complex (based on length, keywords, and question count)
query_complexity_scoreFloatFieldComplexity score (0.0–1.0)
documents_retrievedIntegerFieldNumber of chunks returned by hybrid search
top_kIntegerFieldThe adaptive top_k value used for this query

RAGASMetrics — Per-query accuracy

Created alongside QueryMetrics when RAGAS_ENABLED=true or when the custom evaluator runs. Linked via query_metrics foreign key.
FieldTypeDescription
evaluation_idCharFieldUnique evaluation record ID
precision_at_kFloatFieldPrecision@k score (0.0–1.0)
recall_at_kFloatFieldRecall@k score (0.0–1.0)
faithfulness_scoreFloatFieldFaithfulness score (0.0–1.0)
answer_relevancyFloatFieldAnswer relevancy score (0.0–1.0)
hallucination_rateFloatFieldDerived: 1.0 - faithfulness_score
wer_scoreFloatFieldWord Error Rate (0.0–∞); populated only when ground truth is available
k_valueIntegerFieldThe k used for precision/recall calculations

Metrics Reference

MetricRangeCalculation Method
Precision@k0–1Jaccard similarity between each retrieved chunk and the generated answer; chunk is “relevant” if similarity > 8% threshold
Recall@k0–13-word n-gram extraction from each chunk; chunk “used” if any n-gram appears in the answer
Faithfulness0–1Response split into sentences; sentence “supported” if >60% of its keywords appear in the combined context
Hallucination Rate0–11.0 − Faithfulness; lower is better
Answer Relevancy0–1Keyword overlap between the question and the answer, with a +0.1 bonus for answers 20–300 words long
WER0–∞Levenshtein distance (word-level) between the generated answer and a ground-truth reference; only computed when a reference is provided
The custom evaluator runs entirely in Python using only the standard library and NumPy — no network calls, no API keys, no quota limits. Evaluation adds negligible latency because it runs asynchronously after the response has already been returned to the user.

Enabling the Evaluator

Metric evaluation is gated by the RAGAS_ENABLED environment variable (default false). Set it to activate per-query accuracy scoring:
RAGAS_ENABLED=true
When disabled, latency metrics (QueryMetrics) are still captured; only the accuracy scores (RAGASMetrics) are skipped.

Accessing Metrics

Admin Panel

Navigate to Admin Panel → Metrics tab to see aggregated charts and per-query breakdowns.

REST API

GET /api/admin/metrics/
Authorization: Bearer <admin_JWT>
Example response:
{
  "success": true,
  "metrics": {
    "performance": {
      "avgResponseTime": 2.34,
      "timeToFirstToken": 0.45,
      "complexQueryTime": 3.12,
      "totalQueries": 156
    },
    "precision": {
      "precisionAtK": 0.85,
      "recallAtK": 0.78,
      "hallucinationRate": 0.08,
      "faithfulness": 0.92
    },
    "metadata": {
      "lastUpdated": "2025-11-14T03:50:00",
      "dataSource": "real_database_custom_metrics",
      "kValue": 5,
      "isRealData": true
    }
  }
}

Django ORM Queries

Query the models directly from a Django shell or management command:
from django.db.models import Avg
from api.models import QueryMetrics, RAGASMetrics

# Average latency across all queries
QueryMetrics.objects.aggregate(avg_latency=Avg('total_latency'))
# → {'avg_latency': 2.34}

# Average precision and faithfulness
RAGASMetrics.objects.aggregate(
    avg_precision=Avg('precision_at_k'),
    avg_faithfulness=Avg('faithfulness_score'),
    avg_wer=Avg('wer_score'),
)
# → {'avg_precision': 0.85, 'avg_faithfulness': 0.92, 'avg_wer': 0.07}

# Latency for complex queries only
QueryMetrics.objects.filter(is_complex_query=True).aggregate(
    avg_complex_latency=Avg('total_latency')
)

A/B Experiment Tracking

The ExperimentRun model stores the results of A/B experiments comparing configuration variants (e.g. different top_k values, embedding models, or prompt templates).
FieldDescription
baseline_*Metric values for the control configuration
variant_*Metric values for the experimental configuration
delta_*Auto-computed difference variant − baseline; editable=False (set in save())
guardrail_passedTrue if the variant meets all minimum quality thresholds
guardrail_reasonHuman-readable explanation when a guardrail fails

Experiment Endpoints

MethodPathDescription
GET/api/experiments/List all experiment runs (admin only)
POST/api/experiments/create/Record a new experiment run
GET/api/experiments/latest/Fetch the most recent experiment
All experiment and metrics endpoints require an admin JWT (is_staff=True). Requests from non-staff users receive a 403 Forbidden response.

Metric Calculation Walk-Through

The following example shows how each metric is computed for a single real query, using the method signatures from CustomMetricsEvaluator: Question: "¿Qué es ISO 27001 y cuáles son sus controles principales?" Retrieved chunks (k=5):
  1. "ISO 27001 define requisitos para establecer..." — overlap 45% → relevant
  2. "Los controles de ISO 27001 incluyen gestión de acceso..." — overlap 38% → relevant
  3. "La norma ISO 27002 proporciona guías..." — overlap 25% → relevant
  4. "Documento sobre GDPR y privacidad..." — overlap 5% → not relevant
  5. "Manual de configuración de firewalls..." — overlap 2% → not relevant
MetricCalculationScore
Precision@53 relevant / 5 retrieved0.60
Recall@5Response uses phrases from chunks 1, 2, 3 → 3/5 covered0.60
Faithfulness7 of 8 response sentences have >60% keyword coverage in context0.875
Hallucination Rate1 − 0.8750.125
Answer RelevancyAll 4 question keywords present; response 50 words long (+0.1 bonus)1.0

Build docs developers (and LLMs) love