Model Evaluation

Evaluation happens in the TESTING block of sumo.Rmd. The workflow is: split the labelled history into training and test sets, fit both models on the training set, generate predictions on the test set, binarise those predictions, then compare them against the known results.

Train/test split

spl <- data.split(history, 0.85)
tr  <- spl[[1]]   # training set  (~85 % of rows)
ev  <- spl[[2]]   # evaluation set (~15 % of rows)

data.split() randomly samples a fraction of rows for training and puts the rest aside for evaluation:

data.split <- function(data, ratio) {
  n     <- round(nrow(data) * ratio)
  train <- sample(1:nrow(data), n)
  test  <- setdiff(1:nrow(data), train)
  return(list(data[train,], data[test,]))
}

The default ratio is 0.85, meaning 85 % of completed bouts train the models and 15 % are held back for unbiased testing.

With a small dataset the 15 % evaluation split may contain only a handful of rows. Accuracy figures from a single split can vary significantly — run multiple splits and average the results for a more stable estimate.

Fitting on the training set

bin <- glm(result ~ ., tr, family = 'binomial')

nn <- neuralnet(result ~ ., normalize(tr), hidden = c(4, 2),
                linear.output = FALSE, act.fct = 'logistic')

Note that the neural network receives a normalised version of tr (min–max scaled to [0, 1]) while the GLM receives the raw values. The evaluation set ev is passed unnormalised to both predict calls.

Generating and binarising predictions

Raw model outputs are collected and then converted to boolean (TRUE/FALSE) using the threshold rule for each model:

bin.ans <- predict.glm(bin, ev)
nn.ans  <- predict(nn, ev)

results <- tibble(bin.ans) %>% cbind(nn.ans) %>% cbind(ev$result)

GLM threshold: `bin.ans > 0`

results$bin.ans = sapply(results$bin.ans,
                         function(x) {if (x > 0) {
                           TRUE
                           } else {
                             FALSE}})

The GLM returns a log-odds score. Any positive value means the model favours wrestler 1; any negative value means it favours wrestler 2.

Neural network threshold: `nn.ans >= 0.5`

results$nn.ans = sapply(results$nn.ans,
                         function(x) {if (x >= 0.5) {
                           TRUE
                           } else {
                             FALSE}})

The neural network (with linear.output = FALSE) returns a probability. The natural decision boundary is 0.5.

The `results` tibble

After binarisation, column names are standardised:

colnames(results) <- c('glm', 'nn', 'truth')

Column	Type	Description
`glm`	logical	GLM’s binarised prediction (`TRUE` = left wins)
`nn`	logical	Neural network’s binarised prediction
`truth`	logical	Actual match result from the evaluation set

Confusion matrix

caret’s confusionMatrix() compares each model’s predictions against the ground truth:

confusionMatrix(as.factor(results$glm), as.factor(results$truth))
confusionMatrix(as.factor(results$nn),  as.factor(results$truth))

The function reports accuracy, sensitivity, specificity, and other metrics. The GLM consistently achieves around 70% accuracy on the held-out evaluation set — higher than the neural network.

Both vectors must be converted with as.factor() before passing to confusionMatrix(). Passing raw logical vectors will raise an error.

Interpreting the results

Metric	What it tells you
Accuracy	Overall fraction of correctly predicted bouts
Sensitivity	How often the model correctly predicts wrestler 1 wins when they do
Specificity	How often the model correctly predicts wrestler 2 wins when they do
Kappa	Agreement corrected for chance; useful when classes are imbalanced

A ~70% accuracy from the GLM means it picks the correct winner roughly 7 times out of 10, substantially above the 50% random baseline. The neural network’s accuracy sits below the GLM’s on this dataset, likely due to the small training size relative to the network’s capacity.

Get Started

Concepts

Guides

Reference

Train/test split

Fitting on the training set

Generating and binarising predictions

GLM threshold: `bin.ans > 0`

Neural network threshold: `nn.ans >= 0.5`

The `results` tibble

Confusion matrix

Interpreting the results

Build docs developers (and LLMs) love

Get Started

Concepts

Guides

Reference

​Train/test split

​Fitting on the training set

​Generating and binarising predictions

​GLM threshold: bin.ans > 0

​Neural network threshold: nn.ans >= 0.5

​The results tibble

​Confusion matrix

​Interpreting the results

Build docs developers (and LLMs) love

Train/test split

Fitting on the training set

Generating and binarising predictions

GLM threshold: `bin.ans > 0`

Neural network threshold: `nn.ans >= 0.5`

The `results` tibble

Confusion matrix

Interpreting the results