Text generation evaluation measures how closely a model’s output matches one or more reference texts. Use these metrics to evaluate summarization, machine translation, or answers generated by a RAG pipeline.
reval provides two complementary approaches:
- ROUGE — lexical overlap between candidate and reference tokens
- BERTScore — semantic similarity using dense vector embeddings
Tokenization
All reval text functions operate on []string token slices, not raw strings. You are responsible for tokenizing text before passing it in.
import "strings"
candidate := "the cat is sitting on the mat"
tokens := strings.Fields(candidate) // ["the", "cat", "is", "sitting", "on", "the", "mat"]
strings.Fields splits on whitespace and is fine for quick experiments. For production evaluation, use a proper tokenizer that handles punctuation, casing, and stemming consistently between candidates and references.
ROUGE-1
ROUGE-1 measures unigram (single token) overlap between candidate and reference.
package main
import (
"fmt"
"github.com/itsubaki/reval"
)
func main() {
candidates := []string{"the", "cat", "is", "sitting", "on", "the", "mat"}
refs := []string{"the", "cat", "sat", "on", "the", "mat"}
precision, recall, f1 := reval.ROUGE1(candidates, refs)
fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
// Output: 0.7143, 0.8333, 0.7692
}
All three values are returned: precision (what fraction of candidate tokens appear in the reference), recall (what fraction of reference tokens appear in the candidate), and F1 (harmonic mean).
ROUGE-L
ROUGE-L measures the Longest Common Subsequence (LCS) between candidate and reference, capturing in-order word matches without requiring them to be contiguous.
package main
import (
"fmt"
"github.com/itsubaki/reval"
)
func main() {
candidates := []string{"the", "cat", "is", "sitting", "on", "the", "mat"}
refs := []string{"the", "cat", "sat", "on", "the", "mat"}
precision, recall, f1 := reval.ROUGEL(candidates, refs)
fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
// Output: 0.7143, 0.8333, 0.7692
}
ROUGE-L is more sensitive to word order than ROUGE-1. Use ROUGE-L when the ordering of key phrases matters (e.g., translation quality). Use ROUGE-1 for bag-of-words overlap tasks like keyword coverage in summaries.
ROUGE-Lsum
ROUGELsum evaluates multi-sentence candidates against multiple reference sentences. It finds the best-matching reference for each candidate sentence and accumulates LCS scores across all sentences.
package main
import (
"fmt"
"github.com/itsubaki/reval"
)
func main() {
candidates := [][]string{
{"the", "cat", "is", "on", "the", "mat"},
{"it", "is", "cute"},
}
refs := [][]string{
{"the", "dog", "is", "on", "the", "mat"},
{"the", "animal", "is", "cute"},
{"the", "pet", "sleeps", "well"},
}
precision, recall, f1 := reval.ROUGELsum(candidates, refs)
fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
// Output: 0.7778, 0.5000, 0.6087
}
Use ROUGELsum when your candidate is a multi-sentence document (e.g., an extractive summary) and you have multiple reference summaries to compare against.
BERTScore
BERTScore computes semantic similarity using pre-computed token embeddings. It greedily matches each candidate embedding to the most similar reference embedding via dot product, then returns precision, recall, and F1.
BERTScore expects embeddings you compute yourself — for example, using a BERT, Sentence-BERT, or any other embedding model. The function does not perform tokenization or encoding; it only handles the matching and scoring step.
package main
import (
"fmt"
"github.com/itsubaki/reval"
)
func main() {
// Each inner slice is one token's embedding vector.
// In practice, generate these with a real embedding model.
candidates := [][]float64{
{0.1, 0.2, 0.3},
{0.4, 0.5, 0.6},
}
refs := [][]float64{
{0.1, 0.2, 0.3},
{0.7, 0.8, 0.9},
}
precision, recall, f1 := reval.BERTScore(candidates, refs)
fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
// Output: 0.8600, 0.7700, 0.8125
}
L2-normalizing embeddings
If your embedding model does not produce unit-norm vectors, normalize them before scoring to ensure dot product equals cosine similarity:
for i, emb := range candidates {
candidates[i] = reval.Normalize(emb)
}
for i, emb := range refs {
refs[i] = reval.Normalize(emb)
}
ROUGE vs BERTScore
| ROUGE | BERTScore |
|---|
| Measures | Lexical token overlap | Semantic vector similarity |
| Requires embeddings | No | Yes |
| Sensitive to synonyms | No | Yes |
| Fast to compute | Yes | Depends on embedding model |
| Best for | Quick offline eval, shared tasks | Semantic quality, paraphrase tolerance |