Skip to main content
BERTScore measures semantic similarity between a candidate text and a reference text using dense token embeddings. Rather than counting exact token matches like ROUGE, BERTScore compares meaning by looking at how similar the embedding vectors are — so paraphrases and synonyms can still earn high scores.
BERTScore expects pre-computed token embeddings — for example, the contextual vectors produced by a BERT model for each token in a sentence. You must extract these embeddings yourself before calling this function; reval does not perform tokenisation or model inference.Internally, the function computes greedy alignment using dot product similarity: each token in the candidate is matched to its most similar token in the reference (and vice versa), and the scores are averaged to produce precision, recall, and F1.For best results, L2-normalise your embeddings with Normalize before passing them in so that dot product equals cosine similarity.

BERTScore

func BERTScore(candidates, refs [][]float64) (precision, recall, f1 float64)
Returns the BERTScore between a candidate and a reference, both represented as sequences of dense token embedding vectors. Precision is computed by greedily matching each candidate token to its best reference token; recall is computed in the reverse direction; F1 is their harmonic mean. Returns zero for all outputs when either slice is empty.
candidates
[][]float64
required
A sequence of token embedding vectors for the candidate text. Each inner slice is the embedding for one token. All vectors should have the same dimensionality.
refs
[][]float64
required
A sequence of token embedding vectors for the reference text. Each inner slice is the embedding for one token. All vectors should have the same dimensionality as those in candidates.
Returns three float64 values:
precision
float64
Average maximum similarity from each candidate token to the most similar reference token.
recall
float64
Average maximum similarity from each reference token to the most similar candidate token.
f1
float64
Harmonic mean of precision and recall.

Example

func ExampleBERTScore() {
	candidates := [][]float64{
		{0.1, 0.2, 0.3},
		{0.4, 0.5, 0.6},
	}
	refs := [][]float64{
		{0.1, 0.2, 0.3},
		{0.7, 0.8, 0.9},
	}

	precision, recall, f1 := reval.BERTScore(candidates, refs)
	fmt.Printf("%.4f, %.4f, %.4f\n", precision, recall, f1)

	// Output:
	// 0.8600, 0.7700, 0.8125
}

DotProduct

func DotProduct(a, b []float64) float64
Returns the dot product of two vectors. This is the similarity function used internally by BERTScore for greedy token alignment. When vectors are L2-normalised, the dot product equals cosine similarity. Returns 0 if the vectors have different lengths.
a
[]float64
required
The first vector.
b
[]float64
required
The second vector. Must have the same length as a, otherwise 0 is returned.
Returns float64 — the sum of element-wise products.

L2Norm

func L2Norm(a []float64) float64
Returns the L2 (Euclidean) norm of vector a — the square root of the sum of squared elements. This is the magnitude used to normalise vectors before computing cosine similarity.
a
[]float64
required
The input vector.
Returns float64 — the Euclidean length of the vector.

Normalize

func Normalize(a []float64) []float64
Returns a new vector that is the L2-normalised version of a — that is, a divided by its L2 norm so that the result has unit length. If the norm is zero (the zero vector), the original slice is returned unchanged.
a
[]float64
required
The input vector to normalise.
Returns []float64 — a new slice with the same direction as a and a magnitude of 1.0.
Normalise your embeddings before passing them to BERTScore. When both the candidate and reference embeddings are unit vectors, the dot product computed internally is equivalent to cosine similarity, which is the standard similarity measure used in the original BERTScore paper.

Build docs developers (and LLMs) love