Skip to main content
1

Install

Add reval to your Go module:
go get github.com/itsubaki/reval
The library has zero external dependencies and requires Go 1.24.5 or later.
2

Import

Import the package in your Go source file:
import "github.com/itsubaki/reval"
3

Compute Precision@K

Precision measures how many of the top K predicted results are relevant. Pass a ranked list of predicted item IDs, a map of relevance scores (any value ≥ 1 counts as relevant), and the cutoff K.
package main

import (
	"fmt"

	"github.com/itsubaki/reval"
)

func main() {
	predicted := []string{"A", "B", "C", "D"}
	relevance := map[string]int{
		"A": 3,
		"B": 2,
		"C": 0,
		"D": 0,
		"E": 3,
	}

	s := reval.Precision(predicted, relevance, 3)
	fmt.Println("Precision@3:", s)
	// Output:
	// Precision@3: 0.6666666666666666
}
Of the top 3 results (A, B, C), two are relevant (A and B), so Precision@3 = 2/3.
Use AveragePrecision to reward systems that rank relevant items higher, and MeanAveragePrecision to aggregate scores across multiple queries.
4

Compute ROUGE-1

ROUGE1 measures unigram overlap between a candidate sequence and a reference sequence. It returns precision, recall, and F1.
package main

import (
	"fmt"

	"github.com/itsubaki/reval"
)

func main() {
	candidates := []string{"the", "cat", "is", "sitting", "on", "the", "mat"}
	refs := []string{"the", "cat", "sat", "on", "the", "mat"}

	precision, recall, f1 := reval.ROUGE1(candidates, refs)
	fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
	// Output:
	// 0.7143, 0.8333, 0.7692
}
Both candidates and refs are pre-tokenized slices of strings — reval does not perform tokenization itself.
Use ROUGEL for longest common subsequence scoring, or ROUGELsum when comparing multi-sentence summaries.
5

Compute BERTScore

BERTScore computes semantic similarity between candidate and reference token embeddings using maximum pairwise dot products. Pass pre-computed embedding vectors — reval does not call any model or external service.
package main

import (
	"fmt"

	"github.com/itsubaki/reval"
)

func main() {
	candidates := [][]float64{
		{0.1, 0.2, 0.3},
		{0.4, 0.5, 0.6},
	}
	refs := [][]float64{
		{0.1, 0.2, 0.3},
		{0.7, 0.8, 0.9},
	}

	precision, recall, f1 := reval.BERTScore(candidates, refs)
	fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
	// Output:
	// 0.8600, 0.7700, 0.8125
}
Each inner slice is one token’s embedding vector. Use reval.Normalize to L2-normalize vectors before scoring if your embeddings are not already unit-normalized.

Explore the metrics reference

Precision & MAP

Precision, AveragePrecision, MeanAveragePrecision, and the QueryResult type.

Recall

Recall — fraction of all relevant items retrieved in the top K results.

NDCG

NDCG and DCG — graded ranking quality with position-based discounting.

ROUGE

ROUGE1, ROUGEL, ROUGELsum — text overlap for summarization evaluation.

BERTScore

BERTScore — semantic similarity using dense embedding vectors.

Build docs developers (and LLMs) love