Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/intuit-ai-research/REMem/llms.txt

Use this file to discover all available pages before exploring further.

QA evaluation metrics measure the quality of generated answers against gold standard answers. These metrics are commonly used to evaluate question-answering systems.

QAExactMatch

Measures whether the predicted answer exactly matches any of the gold answers after normalization.

Usage

from remem.evaluation.qa_eval import QAExactMatch

metric = QAExactMatch()
pooled_results, example_results = metric.calculate_metric_scores(
    gold_answers=[["Paris", "paris"], ["42"]],
    predicted_answers=["Paris", "forty-two"]
)
print(pooled_results)  # {"ExactMatch": 0.5}

Parameters

global_config
Optional[BaseConfig]
Global configuration object (optional)

Methods

calculate_metric_scores

Calculates the Exact Match (EM) score. Signature:
def calculate_metric_scores(
    gold_answers: List[List[str]],
    predicted_answers: List[str],
    aggregation_fn: Callable = np.max,
    **kwargs
) -> Tuple[Dict[str, float], List[Dict[str, float]]]
Arguments:
gold_answers
List[List[str]]
required
List of lists containing ground truth answers. Each inner list contains multiple acceptable answers for that example.
predicted_answers
List[str]
required
List of predicted answers, one per example.
aggregation_fn
Callable
default:"np.max"
Function to aggregate scores across multiple gold answers. Defaults to taking the maximum score.
Returns: A tuple containing:
  • Dict[str, float]: Pooled results with the averaged EM score across all examples
  • List[Dict[str, float]]: Per-example results with EM scores

Interpretation

  • Score Range: 0.0 to 1.0
  • Higher is Better: Yes
  • Perfect Score: 1.0 means all predicted answers exactly match at least one gold answer
  • Use Case: Best for tasks where exact answer matching is required (e.g., factoid QA)

QAF1Score

Measures token-level overlap between predicted and gold answers using F1 score (harmonic mean of precision and recall).

Usage

from remem.evaluation.qa_eval import QAF1Score

metric = QAF1Score()
pooled_results, example_results = metric.calculate_metric_scores(
    gold_answers=[["The capital of France is Paris"]],
    predicted_answers=["Paris is the capital"]
)
print(pooled_results)  # {"F1": 0.667}

Parameters

global_config
Optional[BaseConfig]
Global configuration object (optional)

Methods

calculate_metric_scores

Calculates the F1 score based on token overlap. Signature:
def calculate_metric_scores(
    gold_answers: List[List[str]],
    predicted_answers: List[str],
    aggregation_fn: Callable = np.max,
    **kwargs
) -> Tuple[Dict[str, float], List[Dict[str, float]]]
Arguments:
gold_answers
List[List[str]]
required
List of lists containing ground truth answers.
predicted_answers
List[str]
required
List of predicted answers.
aggregation_fn
Callable
default:"np.max"
Function to aggregate scores across multiple gold answers.
Returns: A tuple containing:
  • Dict[str, float]: Pooled results with averaged F1 score
  • List[Dict[str, float]]: Per-example F1 scores

Interpretation

  • Score Range: 0.0 to 1.0
  • Higher is Better: Yes
  • Calculation: F1 = 2 * (precision * recall) / (precision + recall)
    • Precision: Fraction of predicted tokens that appear in gold answer
    • Recall: Fraction of gold answer tokens that appear in prediction
  • Use Case: Better for partial credit when answers are similar but not exact matches

QABleu1Score

Evaluates answer quality using BLEU-1 (unigram precision) score.

Usage

from remem.evaluation.qa_bleu import QABleu1Score

metric = QABleu1Score()
pooled_results, example_results = metric.calculate_metric_scores(
    gold_answers=[["The Eiffel Tower is in Paris"]],
    predicted_answers=["The tower is in Paris"]
)
print(pooled_results)  # {"BLEU-1": 0.8}

Parameters

global_config
Optional[BaseConfig]
Global configuration object (optional)

Methods

calculate_metric_scores

Calculates the BLEU-1 score between predicted and gold answers. Signature:
def calculate_metric_scores(
    gold_answers: List[List[str]],
    predicted_answers: List[str],
    aggregation_fn: Callable = np.max,
    **kwargs
) -> Tuple[Dict[str, float], List[Dict[str, float]]]
Arguments:
gold_answers
List[List[str]]
required
List of lists containing ground truth answers.
predicted_answers
List[str]
required
List of predicted answers.
aggregation_fn
Callable
default:"np.max"
Function to aggregate scores across multiple gold answers.
Returns: A tuple containing:
  • Dict[str, float]: Pooled results with averaged BLEU-1 score
  • List[Dict[str, float]]: Per-example BLEU-1 scores

Interpretation

  • Score Range: 0.0 to 1.0
  • Higher is Better: Yes
  • Measures: Unigram precision (how many individual words match)
  • Use Case: Good for measuring word-level overlap with brevity consideration
Requires the evaluate library: pip install evaluate

QABleu4Score

Evaluates answer quality using BLEU-4 (up to 4-gram precision) score.

Usage

from remem.evaluation.qa_bleu import QABleu4Score

metric = QABleu4Score()
pooled_results, example_results = metric.calculate_metric_scores(
    gold_answers=[["The Eiffel Tower is located in Paris, France"]],
    predicted_answers=["The Eiffel Tower is in Paris"]
)
print(pooled_results)  # {"BLEU-4": ...}

Parameters

global_config
Optional[BaseConfig]
Global configuration object (optional)

Methods

calculate_metric_scores

Calculates the BLEU-4 score between predicted and gold answers. Signature:
def calculate_metric_scores(
    gold_answers: List[List[str]],
    predicted_answers: List[str],
    aggregation_fn: Callable = np.max,
    **kwargs
) -> Tuple[Dict[str, float], List[Dict[str, float]]]
Arguments:
gold_answers
List[List[str]]
required
List of lists containing ground truth answers.
predicted_answers
List[str]
required
List of predicted answers.
aggregation_fn
Callable
default:"np.max"
Function to aggregate scores across multiple gold answers.
Returns: A tuple containing:
  • Dict[str, float]: Pooled results with averaged BLEU-4 score
  • List[Dict[str, float]]: Per-example BLEU-4 scores

calculate_corpus_bleu

Calculate corpus-level BLEU score (alternative evaluation method). Signature:
def calculate_corpus_bleu(
    gold_answers: List[List[str]],
    predicted_answers: List[str]
) -> Dict[str, float]
Returns: Dictionary containing the corpus BLEU score computed over the entire corpus rather than averaging individual sentence-level scores.

Interpretation

  • Score Range: 0.0 to 1.0
  • Higher is Better: Yes
  • Measures: N-gram precision up to 4-grams (captures phrase-level similarity)
  • Use Case: Better for longer answers where phrase structure matters
  • Note: More strict than BLEU-1; requires longer matching sequences
Requires the evaluate library: pip install evaluate

Common Patterns

Multiple Gold Answers

All metrics support multiple acceptable answers per example:
gold_answers = [
    ["Paris", "paris", "PARIS"],  # Multiple acceptable forms
    ["42", "forty-two", "forty two"]  # Different representations
]
predicted_answers = ["paris", "42"]

em_metric = QAExactMatch()
results, _ = em_metric.calculate_metric_scores(gold_answers, predicted_answers)

Custom Aggregation

By default, metrics use np.max to take the best score across gold answers. You can customize this:
import numpy as np

# Use mean instead of max
metric = QAF1Score()
results, _ = metric.calculate_metric_scores(
    gold_answers=[["answer1", "answer2"]],
    predicted_answers=["answer1"],
    aggregation_fn=np.mean  # Average across gold answers
)

Answer Normalization

EM and F1 metrics automatically normalize answers by:
  • Converting to lowercase
  • Removing articles (a, an, the)
  • Removing punctuation
  • Removing extra whitespace
This ensures fair comparison across formatting variations.

Build docs developers (and LLMs) love