ROUGE

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a family of metrics for evaluating text generation quality by comparing a candidate text against one or more reference texts. The reval package implements ROUGE-1, ROUGE-L, and ROUGE-Lsum, along with the underlying helper functions.

Concepts

ROUGE-1

Measures unigram overlap — the count of individual tokens shared between the candidate and the reference, regardless of order. It captures vocabulary coverage but ignores word sequence.

ROUGE-L

Measures the Longest Common Subsequence (LCS) — the longest sequence of tokens that appears in both texts in the same relative order, but not necessarily contiguously. It captures sentence-level fluency and structure.

ROUGE-Lsum extends ROUGE-L to multi-sentence summaries: it computes per-sentence LCS against the best-matching reference sentence, then aggregates across all candidate sentences. All three variants return precision, recall, and F1 so you can choose which aspect of quality to optimise for.

`ROUGE1`

func ROUGE1(candidates, refs []string) (precision, recall, f1 float64)

Returns the ROUGE-1 score based on unigram (token) overlap between candidates and refs. Duplicate tokens are handled correctly: if a token appears twice in both the candidate and the reference, it counts as two matches. Returns zero for all outputs when either slice is empty.

candidates

[]string

required

The tokenised candidate text as a slice of strings.

refs

[]string

required

The tokenised reference text as a slice of strings.

Returns three float64 values:

precision

float64

Fraction of candidate tokens that appear in the reference.

recall

float64

Fraction of reference tokens that appear in the candidate.

float64

Harmonic mean of precision and recall (F1 score).

Example

func ExampleROUGE1() {
	candidates := []string{"the", "cat", "is", "sitting", "on", "the", "mat"}
	refs := []string{"the", "cat", "sat", "on", "the", "mat"}

	precision, recall, f1 := reval.ROUGE1(candidates, refs)
	fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)

	// Output:
	// 0.7143, 0.8333, 0.7692
}

`ROUGEL`

func ROUGEL(candidates, refs []string) (precision, recall, f1 float64)

Returns the ROUGE-L score based on the Longest Common Subsequence between candidates and refs. Unlike ROUGE-1, ROUGE-L requires tokens to appear in the same relative order, making it sensitive to word sequence and sentence structure. Returns zero for all outputs when either slice is empty.

candidates

[]string

required

The tokenised candidate text as a slice of strings.

refs

[]string

required

The tokenised reference text as a slice of strings.

Returns three float64 values:

precision

float64

LCS length divided by the number of candidate tokens.

recall

float64

LCS length divided by the number of reference tokens.

float64

Harmonic mean of precision and recall.

Example

func ExampleROUGEL() {
	candidates := []string{"the", "cat", "is", "sitting", "on", "the", "mat"}
	refs := []string{"the", "cat", "sat", "on", "the", "mat"}

	precision, recall, f1 := reval.ROUGEL(candidates, refs)
	fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)

	// Output:
	// 0.7143, 0.8333, 0.7692
}

ROUGE-1 and ROUGE-L produce the same scores for this example because the shared tokens happen to form the LCS. In general, ROUGE-L ≤ ROUGE-1 since a common subsequence is a subset of unigram overlaps. The difference becomes apparent when shared tokens appear in a different order (e.g., ["a","b","c"] vs ["c","b","a"] has ROUGE-1=1.0 but ROUGE-L≈0.33).

`ROUGELsum`

func ROUGELsum(candidates, refs [][]string) (precision, recall, f1 float64)

Returns the ROUGE-Lsum score for multi-sentence summaries. For each candidate sentence, the function finds the best-matching reference sentence by LCS length, then accumulates across all candidate sentences. Returns zero for all outputs when either slice is empty.

candidates

[][]string

required

A slice of tokenised candidate sentences. Each inner slice is one sentence represented as a sequence of string tokens.

refs

[][]string

required

A slice of tokenised reference sentences. Each inner slice is one reference sentence.

Returns three float64 values:

precision

float64

Total LCS tokens divided by total candidate tokens across all sentences.

recall

float64

Total LCS tokens divided by total reference tokens across all sentences.

float64

Harmonic mean of precision and recall.

Example

func ExampleROUGELsum() {
	candidates := [][]string{
		{"the", "cat", "is", "on", "the", "mat"},
		{"it", "is", "cute"},
	}

	refs := [][]string{
		{"the", "dog", "is", "on", "the", "mat"},
		{"the", "animal", "is", "cute"},
		{"the", "pet", "sleeps", "well"},
	}

	precision, recall, f1 := reval.ROUGELsum(candidates, refs)
	fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)

	// Output:
	// 0.7778, 0.5000, 0.6087
}

`Overlap`

func Overlap(a, b []string) int

Returns the count of overlapping tokens between slices a and b. Each token in b is matched against tokens in a at most once, correctly handling duplicate tokens in both slices. This is the matching function used internally by ROUGE1.

[]string

required

The first token sequence (typically the candidate).

[]string

required

The second token sequence (typically the reference).

Returns int — the number of matched token pairs.

`LCS`

func LCS(a, b []string) int

Returns the length of the Longest Common Subsequence between a and b using dynamic programming. A common subsequence is a sequence of tokens that appears in both slices in the same relative order but not necessarily contiguously. This is the matching function used internally by ROUGEL and ROUGELsum.

[]string

required

The first token sequence.

[]string

required

The second token sequence.

Returns int — the length of the longest common subsequence.

`F1`

func F1(precision, recall float64) float64

Returns the F1 score as the harmonic mean of precision and recall. Equivalent to FBeta(precision, recall, 1.0). Returns 0.0 when both inputs are zero.

precision

float64

required

The precision value in [0, 1].

recall

float64

required

The recall value in [0, 1].

Returns float64 — the F1 score.

`FBeta`

func FBeta(precision, recall, beta float64) float64

Returns the F-beta score, a generalisation of F1 that allows you to weight precision and recall differently. The formula is:

F_β = (1 + β²) × precision × recall / (β² × precision + recall)

When beta = 1.0 this is identical to F1. Values of beta > 1 weight recall more heavily; values of beta < 1 weight precision more heavily. Returns 0.0 when both precision and recall are zero.

precision

float64

required

The precision value in [0, 1].

recall

float64

required

The recall value in [0, 1].

beta

float64

required

The weighting factor. Use 1.0 for balanced F1, 2.0 to emphasise recall (F2), or 0.5 to emphasise precision (F0.5).

Returns float64 — the F-beta score.

Get Started

Metrics Reference

Guides

Concepts

ROUGE-1

ROUGE-L

`ROUGE1`

Example

`ROUGEL`

Example

`ROUGELsum`

Example

`Overlap`

`LCS`

`F1`

`FBeta`

Build docs developers (and LLMs) love

Get Started

Metrics Reference

Guides

Documentation Index

​Concepts

ROUGE-1

ROUGE-L

​ROUGE1

​Example

​ROUGEL

​Example

​ROUGELsum

​Example

​Overlap

​LCS

​F1

​FBeta

Build docs developers (and LLMs) love

Concepts

`ROUGE1`

Example

`ROUGEL`

Example

`ROUGELsum`

Example

`Overlap`

`LCS`

`F1`

`FBeta`