Quickstart

Install

Add reval to your Go module:

go get github.com/itsubaki/reval

The library has zero external dependencies and requires Go 1.24.5 or later.

Import

Import the package in your Go source file:

import "github.com/itsubaki/reval"

Compute Precision@K

Precision measures how many of the top K predicted results are relevant. Pass a ranked list of predicted item IDs, a map of relevance scores (any value ≥ 1 counts as relevant), and the cutoff K.

package main

import (
	"fmt"

	"github.com/itsubaki/reval"
)

func main() {
	predicted := []string{"A", "B", "C", "D"}
	relevance := map[string]int{
		"A": 3,
		"B": 2,
		"C": 0,
		"D": 0,
		"E": 3,
	}

	s := reval.Precision(predicted, relevance, 3)
	fmt.Println("Precision@3:", s)
	// Output:
	// Precision@3: 0.6666666666666666
}

Of the top 3 results (A, B, C), two are relevant (A and B), so Precision@3 = 2/3.

Use AveragePrecision to reward systems that rank relevant items higher, and MeanAveragePrecision to aggregate scores across multiple queries.

Compute ROUGE-1

ROUGE1 measures unigram overlap between a candidate sequence and a reference sequence. It returns precision, recall, and F1.

package main

import (
	"fmt"

	"github.com/itsubaki/reval"
)

func main() {
	candidates := []string{"the", "cat", "is", "sitting", "on", "the", "mat"}
	refs := []string{"the", "cat", "sat", "on", "the", "mat"}

	precision, recall, f1 := reval.ROUGE1(candidates, refs)
	fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
	// Output:
	// 0.7143, 0.8333, 0.7692
}

Both candidates and refs are pre-tokenized slices of strings — reval does not perform tokenization itself.

Use ROUGEL for longest common subsequence scoring, or ROUGELsum when comparing multi-sentence summaries.

Compute BERTScore

BERTScore computes semantic similarity between candidate and reference token embeddings using maximum pairwise dot products. Pass pre-computed embedding vectors — reval does not call any model or external service.

package main

import (
	"fmt"

	"github.com/itsubaki/reval"
)

func main() {
	candidates := [][]float64{
		{0.1, 0.2, 0.3},
		{0.4, 0.5, 0.6},
	}
	refs := [][]float64{
		{0.1, 0.2, 0.3},
		{0.7, 0.8, 0.9},
	}

	precision, recall, f1 := reval.BERTScore(candidates, refs)
	fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)
	// Output:
	// 0.8600, 0.7700, 0.8125
}

Each inner slice is one token’s embedding vector. Use reval.Normalize to L2-normalize vectors before scoring if your embeddings are not already unit-normalized.

Explore the metrics reference

Precision & MAP

Precision, AveragePrecision, MeanAveragePrecision, and the QueryResult type.

Recall

Recall — fraction of all relevant items retrieved in the top K results.

NDCG

NDCG and DCG — graded ranking quality with position-based discounting.

ROUGE

ROUGE1, ROUGEL, ROUGELsum — text overlap for summarization evaluation.

BERTScore

BERTScore — semantic similarity using dense embedding vectors.

Get Started

Metrics Reference

Guides

Explore the metrics reference

Precision & MAP

Recall

NDCG

ROUGE

BERTScore

Build docs developers (and LLMs) love

Get Started

Metrics Reference

Guides

Documentation Index

​Explore the metrics reference

Precision & MAP

Recall

NDCG

ROUGE

BERTScore

Build docs developers (and LLMs) love

Explore the metrics reference