Concepts
ROUGE-1
Measures unigram overlap — the count of individual tokens shared between the candidate and the reference, regardless of order. It captures vocabulary coverage but ignores word sequence.
ROUGE-L
Measures the Longest Common Subsequence (LCS) — the longest sequence of tokens that appears in both texts in the same relative order, but not necessarily contiguously. It captures sentence-level fluency and structure.
ROUGE1
candidates and refs. Duplicate tokens are handled correctly: if a token appears twice in both the candidate and the reference, it counts as two matches. Returns zero for all outputs when either slice is empty.
The tokenised candidate text as a slice of strings.
The tokenised reference text as a slice of strings.
float64 values:
Fraction of candidate tokens that appear in the reference.
Fraction of reference tokens that appear in the candidate.
Harmonic mean of precision and recall (F1 score).
Example
ROUGEL
candidates and refs. Unlike ROUGE-1, ROUGE-L requires tokens to appear in the same relative order, making it sensitive to word sequence and sentence structure. Returns zero for all outputs when either slice is empty.
The tokenised candidate text as a slice of strings.
The tokenised reference text as a slice of strings.
float64 values:
LCS length divided by the number of candidate tokens.
LCS length divided by the number of reference tokens.
Harmonic mean of precision and recall.
Example
ROUGE-1 and ROUGE-L produce the same scores for this example because the shared tokens happen to form the LCS. In general, ROUGE-L ≤ ROUGE-1 since a common subsequence is a subset of unigram overlaps. The difference becomes apparent when shared tokens appear in a different order (e.g.,
["a","b","c"] vs ["c","b","a"] has ROUGE-1=1.0 but ROUGE-L≈0.33).ROUGELsum
A slice of tokenised candidate sentences. Each inner slice is one sentence represented as a sequence of string tokens.
A slice of tokenised reference sentences. Each inner slice is one reference sentence.
float64 values:
Total LCS tokens divided by total candidate tokens across all sentences.
Total LCS tokens divided by total reference tokens across all sentences.
Harmonic mean of precision and recall.
Example
Overlap
a and b. Each token in b is matched against tokens in a at most once, correctly handling duplicate tokens in both slices. This is the matching function used internally by ROUGE1.
The first token sequence (typically the candidate).
The second token sequence (typically the reference).
int — the number of matched token pairs.
LCS
a and b using dynamic programming. A common subsequence is a sequence of tokens that appears in both slices in the same relative order but not necessarily contiguously. This is the matching function used internally by ROUGEL and ROUGELsum.
The first token sequence.
The second token sequence.
int — the length of the longest common subsequence.
F1
FBeta(precision, recall, 1.0). Returns 0.0 when both inputs are zero.
The precision value in [0, 1].
The recall value in [0, 1].
float64 — the F1 score.
FBeta
beta = 1.0 this is identical to F1. Values of beta > 1 weight recall more heavily; values of beta < 1 weight precision more heavily. Returns 0.0 when both precision and recall are zero.
The precision value in [0, 1].
The recall value in [0, 1].
The weighting factor. Use
1.0 for balanced F1, 2.0 to emphasise recall (F2), or 0.5 to emphasise precision (F0.5).float64 — the F-beta score.