Skip to main content
Text generation evaluation measures how closely a model’s output matches one or more reference texts. Use these metrics to evaluate summarization, machine translation, or answers generated by a RAG pipeline. reval provides two complementary approaches:
  • ROUGE — lexical overlap between candidate and reference tokens
  • BERTScore — semantic similarity using dense vector embeddings

Tokenization

All reval text functions operate on []string token slices, not raw strings. You are responsible for tokenizing text before passing it in.
import "strings"

candidate := "the cat is sitting on the mat"
tokens := strings.Fields(candidate) // ["the", "cat", "is", "sitting", "on", "the", "mat"]
strings.Fields splits on whitespace and is fine for quick experiments. For production evaluation, use a proper tokenizer that handles punctuation, casing, and stemming consistently between candidates and references.

ROUGE-1

ROUGE-1 measures unigram (single token) overlap between candidate and reference.
package main

import (
    "fmt"
    "github.com/itsubaki/reval"
)

func main() {
    candidates := []string{"the", "cat", "is", "sitting", "on", "the", "mat"}
    refs := []string{"the", "cat", "sat", "on", "the", "mat"}

    precision, recall, f1 := reval.ROUGE1(candidates, refs)
    fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
    // Output: 0.7143, 0.8333, 0.7692
}
All three values are returned: precision (what fraction of candidate tokens appear in the reference), recall (what fraction of reference tokens appear in the candidate), and F1 (harmonic mean).

ROUGE-L

ROUGE-L measures the Longest Common Subsequence (LCS) between candidate and reference, capturing in-order word matches without requiring them to be contiguous.
package main

import (
    "fmt"
    "github.com/itsubaki/reval"
)

func main() {
    candidates := []string{"the", "cat", "is", "sitting", "on", "the", "mat"}
    refs := []string{"the", "cat", "sat", "on", "the", "mat"}

    precision, recall, f1 := reval.ROUGEL(candidates, refs)
    fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
    // Output: 0.7143, 0.8333, 0.7692
}
ROUGE-L is more sensitive to word order than ROUGE-1. Use ROUGE-L when the ordering of key phrases matters (e.g., translation quality). Use ROUGE-1 for bag-of-words overlap tasks like keyword coverage in summaries.

ROUGE-Lsum

ROUGELsum evaluates multi-sentence candidates against multiple reference sentences. It finds the best-matching reference for each candidate sentence and accumulates LCS scores across all sentences.
package main

import (
    "fmt"
    "github.com/itsubaki/reval"
)

func main() {
    candidates := [][]string{
        {"the", "cat", "is", "on", "the", "mat"},
        {"it", "is", "cute"},
    }
    refs := [][]string{
        {"the", "dog", "is", "on", "the", "mat"},
        {"the", "animal", "is", "cute"},
        {"the", "pet", "sleeps", "well"},
    }

    precision, recall, f1 := reval.ROUGELsum(candidates, refs)
    fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
    // Output: 0.7778, 0.5000, 0.6087
}
Use ROUGELsum when your candidate is a multi-sentence document (e.g., an extractive summary) and you have multiple reference summaries to compare against.

BERTScore

BERTScore computes semantic similarity using pre-computed token embeddings. It greedily matches each candidate embedding to the most similar reference embedding via dot product, then returns precision, recall, and F1.
BERTScore expects embeddings you compute yourself — for example, using a BERT, Sentence-BERT, or any other embedding model. The function does not perform tokenization or encoding; it only handles the matching and scoring step.
package main

import (
    "fmt"
    "github.com/itsubaki/reval"
)

func main() {
    // Each inner slice is one token's embedding vector.
    // In practice, generate these with a real embedding model.
    candidates := [][]float64{
        {0.1, 0.2, 0.3},
        {0.4, 0.5, 0.6},
    }
    refs := [][]float64{
        {0.1, 0.2, 0.3},
        {0.7, 0.8, 0.9},
    }

    precision, recall, f1 := reval.BERTScore(candidates, refs)
    fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
    // Output: 0.8600, 0.7700, 0.8125
}

L2-normalizing embeddings

If your embedding model does not produce unit-norm vectors, normalize them before scoring to ensure dot product equals cosine similarity:
for i, emb := range candidates {
    candidates[i] = reval.Normalize(emb)
}
for i, emb := range refs {
    refs[i] = reval.Normalize(emb)
}

ROUGE vs BERTScore

ROUGEBERTScore
MeasuresLexical token overlapSemantic vector similarity
Requires embeddingsNoYes
Sensitive to synonymsNoYes
Fast to computeYesDepends on embedding model
Best forQuick offline eval, shared tasksSemantic quality, paraphrase tolerance

Build docs developers (and LLMs) love