Text Generation Evaluation

Text generation evaluation measures how closely a model’s output matches one or more reference texts. Use these metrics to evaluate summarization, machine translation, or answers generated by a RAG pipeline. reval provides two complementary approaches:

ROUGE — lexical overlap between candidate and reference tokens
BERTScore — semantic similarity using dense vector embeddings

Tokenization

All reval text functions operate on []string token slices, not raw strings. You are responsible for tokenizing text before passing it in.

import "strings"

candidate := "the cat is sitting on the mat"
tokens := strings.Fields(candidate) // ["the", "cat", "is", "sitting", "on", "the", "mat"]

strings.Fields splits on whitespace and is fine for quick experiments. For production evaluation, use a proper tokenizer that handles punctuation, casing, and stemming consistently between candidates and references.

ROUGE-1

ROUGE-1 measures unigram (single token) overlap between candidate and reference.

package main

import (
    "fmt"
    "github.com/itsubaki/reval"
)

func main() {
    candidates := []string{"the", "cat", "is", "sitting", "on", "the", "mat"}
    refs := []string{"the", "cat", "sat", "on", "the", "mat"}

    precision, recall, f1 := reval.ROUGE1(candidates, refs)
    fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
    // Output: 0.7143, 0.8333, 0.7692
}

All three values are returned: precision (what fraction of candidate tokens appear in the reference), recall (what fraction of reference tokens appear in the candidate), and F1 (harmonic mean).

ROUGE-L

ROUGE-L measures the Longest Common Subsequence (LCS) between candidate and reference, capturing in-order word matches without requiring them to be contiguous.

package main

import (
    "fmt"
    "github.com/itsubaki/reval"
)

func main() {
    candidates := []string{"the", "cat", "is", "sitting", "on", "the", "mat"}
    refs := []string{"the", "cat", "sat", "on", "the", "mat"}

    precision, recall, f1 := reval.ROUGEL(candidates, refs)
    fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
    // Output: 0.7143, 0.8333, 0.7692
}

ROUGE-L is more sensitive to word order than ROUGE-1. Use ROUGE-L when the ordering of key phrases matters (e.g., translation quality). Use ROUGE-1 for bag-of-words overlap tasks like keyword coverage in summaries.

ROUGE-Lsum

ROUGELsum evaluates multi-sentence candidates against multiple reference sentences. It finds the best-matching reference for each candidate sentence and accumulates LCS scores across all sentences.

package main

import (
    "fmt"
    "github.com/itsubaki/reval"
)

func main() {
    candidates := [][]string{
        {"the", "cat", "is", "on", "the", "mat"},
        {"it", "is", "cute"},
    }
    refs := [][]string{
        {"the", "dog", "is", "on", "the", "mat"},
        {"the", "animal", "is", "cute"},
        {"the", "pet", "sleeps", "well"},
    }

    precision, recall, f1 := reval.ROUGELsum(candidates, refs)
    fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
    // Output: 0.7778, 0.5000, 0.6087
}

Use ROUGELsum when your candidate is a multi-sentence document (e.g., an extractive summary) and you have multiple reference summaries to compare against.

BERTScore

BERTScore computes semantic similarity using pre-computed token embeddings. It greedily matches each candidate embedding to the most similar reference embedding via dot product, then returns precision, recall, and F1.

BERTScore expects embeddings you compute yourself — for example, using a BERT, Sentence-BERT, or any other embedding model. The function does not perform tokenization or encoding; it only handles the matching and scoring step.

package main

import (
    "fmt"
    "github.com/itsubaki/reval"
)

func main() {
    // Each inner slice is one token's embedding vector.
    // In practice, generate these with a real embedding model.
    candidates := [][]float64{
        {0.1, 0.2, 0.3},
        {0.4, 0.5, 0.6},
    }
    refs := [][]float64{
        {0.1, 0.2, 0.3},
        {0.7, 0.8, 0.9},
    }

    precision, recall, f1 := reval.BERTScore(candidates, refs)
    fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
    // Output: 0.8600, 0.7700, 0.8125
}

L2-normalizing embeddings

If your embedding model does not produce unit-norm vectors, normalize them before scoring to ensure dot product equals cosine similarity:

for i, emb := range candidates {
    candidates[i] = reval.Normalize(emb)
}
for i, emb := range refs {
    refs[i] = reval.Normalize(emb)
}

ROUGE vs BERTScore

	ROUGE	BERTScore
Measures	Lexical token overlap	Semantic vector similarity
Requires embeddings	No	Yes
Sensitive to synonyms	No	Yes
Fast to compute	Yes	Depends on embedding model
Best for	Quick offline eval, shared tasks	Semantic quality, paraphrase tolerance

Get Started

Metrics Reference

Guides

Text Generation Evaluation

Tokenization

ROUGE-1

ROUGE-L

ROUGE-Lsum

BERTScore

L2-normalizing embeddings

ROUGE vs BERTScore

Build docs developers (and LLMs) love

Get Started

Metrics Reference

Guides

Documentation Index

​Tokenization

​ROUGE-1

​ROUGE-L

​ROUGE-Lsum

​BERTScore

​L2-normalizing embeddings

​ROUGE vs BERTScore

Build docs developers (and LLMs) love

Tokenization

ROUGE-1

ROUGE-L

ROUGE-Lsum

BERTScore

L2-normalizing embeddings

ROUGE vs BERTScore