Skip to main content
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a family of metrics for evaluating text generation quality by comparing a candidate text against one or more reference texts. The reval package implements ROUGE-1, ROUGE-L, and ROUGE-Lsum, along with the underlying helper functions.

Concepts

ROUGE-1

Measures unigram overlap — the count of individual tokens shared between the candidate and the reference, regardless of order. It captures vocabulary coverage but ignores word sequence.

ROUGE-L

Measures the Longest Common Subsequence (LCS) — the longest sequence of tokens that appears in both texts in the same relative order, but not necessarily contiguously. It captures sentence-level fluency and structure.
ROUGE-Lsum extends ROUGE-L to multi-sentence summaries: it computes per-sentence LCS against the best-matching reference sentence, then aggregates across all candidate sentences. All three variants return precision, recall, and F1 so you can choose which aspect of quality to optimise for.

ROUGE1

func ROUGE1(candidates, refs []string) (precision, recall, f1 float64)
Returns the ROUGE-1 score based on unigram (token) overlap between candidates and refs. Duplicate tokens are handled correctly: if a token appears twice in both the candidate and the reference, it counts as two matches. Returns zero for all outputs when either slice is empty.
candidates
[]string
required
The tokenised candidate text as a slice of strings.
refs
[]string
required
The tokenised reference text as a slice of strings.
Returns three float64 values:
precision
float64
Fraction of candidate tokens that appear in the reference.
recall
float64
Fraction of reference tokens that appear in the candidate.
f1
float64
Harmonic mean of precision and recall (F1 score).

Example

func ExampleROUGE1() {
	candidates := []string{"the", "cat", "is", "sitting", "on", "the", "mat"}
	refs := []string{"the", "cat", "sat", "on", "the", "mat"}

	precision, recall, f1 := reval.ROUGE1(candidates, refs)
	fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)

	// Output:
	// 0.7143, 0.8333, 0.7692
}

ROUGEL

func ROUGEL(candidates, refs []string) (precision, recall, f1 float64)
Returns the ROUGE-L score based on the Longest Common Subsequence between candidates and refs. Unlike ROUGE-1, ROUGE-L requires tokens to appear in the same relative order, making it sensitive to word sequence and sentence structure. Returns zero for all outputs when either slice is empty.
candidates
[]string
required
The tokenised candidate text as a slice of strings.
refs
[]string
required
The tokenised reference text as a slice of strings.
Returns three float64 values:
precision
float64
LCS length divided by the number of candidate tokens.
recall
float64
LCS length divided by the number of reference tokens.
f1
float64
Harmonic mean of precision and recall.

Example

func ExampleROUGEL() {
	candidates := []string{"the", "cat", "is", "sitting", "on", "the", "mat"}
	refs := []string{"the", "cat", "sat", "on", "the", "mat"}

	precision, recall, f1 := reval.ROUGEL(candidates, refs)
	fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)

	// Output:
	// 0.7143, 0.8333, 0.7692
}
ROUGE-1 and ROUGE-L produce the same scores for this example because the shared tokens happen to form the LCS. In general, ROUGE-L ≤ ROUGE-1 since a common subsequence is a subset of unigram overlaps. The difference becomes apparent when shared tokens appear in a different order (e.g., ["a","b","c"] vs ["c","b","a"] has ROUGE-1=1.0 but ROUGE-L≈0.33).

ROUGELsum

func ROUGELsum(candidates, refs [][]string) (precision, recall, f1 float64)
Returns the ROUGE-Lsum score for multi-sentence summaries. For each candidate sentence, the function finds the best-matching reference sentence by LCS length, then accumulates across all candidate sentences. Returns zero for all outputs when either slice is empty.
candidates
[][]string
required
A slice of tokenised candidate sentences. Each inner slice is one sentence represented as a sequence of string tokens.
refs
[][]string
required
A slice of tokenised reference sentences. Each inner slice is one reference sentence.
Returns three float64 values:
precision
float64
Total LCS tokens divided by total candidate tokens across all sentences.
recall
float64
Total LCS tokens divided by total reference tokens across all sentences.
f1
float64
Harmonic mean of precision and recall.

Example

func ExampleROUGELsum() {
	candidates := [][]string{
		{"the", "cat", "is", "on", "the", "mat"},
		{"it", "is", "cute"},
	}

	refs := [][]string{
		{"the", "dog", "is", "on", "the", "mat"},
		{"the", "animal", "is", "cute"},
		{"the", "pet", "sleeps", "well"},
	}

	precision, recall, f1 := reval.ROUGELsum(candidates, refs)
	fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)

	// Output:
	// 0.7778, 0.5000, 0.6087
}

Overlap

func Overlap(a, b []string) int
Returns the count of overlapping tokens between slices a and b. Each token in b is matched against tokens in a at most once, correctly handling duplicate tokens in both slices. This is the matching function used internally by ROUGE1.
a
[]string
required
The first token sequence (typically the candidate).
b
[]string
required
The second token sequence (typically the reference).
Returns int — the number of matched token pairs.

LCS

func LCS(a, b []string) int
Returns the length of the Longest Common Subsequence between a and b using dynamic programming. A common subsequence is a sequence of tokens that appears in both slices in the same relative order but not necessarily contiguously. This is the matching function used internally by ROUGEL and ROUGELsum.
a
[]string
required
The first token sequence.
b
[]string
required
The second token sequence.
Returns int — the length of the longest common subsequence.

F1

func F1(precision, recall float64) float64
Returns the F1 score as the harmonic mean of precision and recall. Equivalent to FBeta(precision, recall, 1.0). Returns 0.0 when both inputs are zero.
precision
float64
required
The precision value in [0, 1].
recall
float64
required
The recall value in [0, 1].
Returns float64 — the F1 score.

FBeta

func FBeta(precision, recall, beta float64) float64
Returns the F-beta score, a generalisation of F1 that allows you to weight precision and recall differently. The formula is:
F_β = (1 + β²) × precision × recall / (β² × precision + recall)
When beta = 1.0 this is identical to F1. Values of beta > 1 weight recall more heavily; values of beta < 1 weight precision more heavily. Returns 0.0 when both precision and recall are zero.
precision
float64
required
The precision value in [0, 1].
recall
float64
required
The recall value in [0, 1].
beta
float64
required
The weighting factor. Use 1.0 for balanced F1, 2.0 to emphasise recall (F2), or 0.5 to emphasise precision (F0.5).
Returns float64 — the F-beta score.

Build docs developers (and LLMs) love